pith. sign in

arxiv: 2512.11089 · v4 · pith:EJGXZ26Wnew · submitted 2025-12-11 · 📊 stat.ML · cs.LG

TPV: Parameter Perturbations Through the Lens of Test Prediction Variance

Pith reviewed 2026-05-21 16:53 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords test prediction varianceparameter perturbationsoverparameterized modelsmodel robustnesspruninglabel noisebenign overfitting
0
0 comments X

The pith

Training-set sensitivity to parameter perturbations converges to the test-set value in overparameterized models, independent of generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces test prediction variance, or TPV, as a label-free measure of how much a trained model's outputs change under small parameter perturbations. TPV unifies analysis across different perturbation sources including SGD noise, label noise, quantization, and pruning through a trace expression that separates the trained model's geometry from the perturbation details. The central result proves that in the overparameterized limit, the TPV computed solely on training inputs converges to the TPV on unseen test inputs, regardless of whether the model generalizes. This convergence enables practical uses such as a pruning criterion and training-based model selection signals. Empirically the pattern holds even at low widths.

Core claim

The paper proves that training-set TPV converges to its test-set counterpart in the overparameterized limit, irrespective of generalization performance. This supplies the first demonstration that prediction variance under local parameter perturbations can be inferred from training inputs alone. The trace form of TPV recovers the wide-minima hypothesis for SGD and quantization noise while producing a distinct Jacobian-spectral characterization for label noise that links to benign overfitting in nonlinear networks.

What carries the argument

Test prediction variance (TPV), the first-order sensitivity of trained model outputs to parameter perturbations, expressed in a trace form that isolates the Jacobian geometry of the trained model from the specific perturbation mechanism.

If this is right

  • TPV recovers the wide-minima hypothesis for both SGD and quantization perturbations.
  • Label-noise TPV yields a Jacobian-spectral view that connects to benign overfitting.
  • A TPV-derived pruning criterion matches leading label-free baselines on standard tasks.
  • Training-set TPV provides a usable signal for model selection in both in-distribution and transfer settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed stability at low widths suggests TPV could serve as a cheap diagnostic even when models are not extremely overparameterized.
  • Because TPV separates geometry from perturbation type, the same machinery might apply to other post-training changes such as quantization-aware fine-tuning.
  • The link to benign overfitting raises the possibility that TPV could flag regimes where overparameterized networks remain robust to label noise.

Load-bearing premise

The first-order Taylor approximation around the trained parameters suffices to capture the relevant robustness behavior for SGD noise, label noise, quantization, and pruning.

What would settle it

A concrete counter-example would be an overparameterized network on a standard image dataset where training-set TPV and test-set TPV differ by a large relative amount after training to convergence.

Figures

Figures reproduced from arXiv: 2512.11089 by Devansh Arpit.

Figure 1
Figure 1. Figure 1: TPV stability on synthetic data: Each point corre￾sponds to one synthetic configuration (dataset type, input dimen￾sion, network width, depth) and one perturbation source (label noise; SGD noise). Axes show empirical TPV on the training and test sets; y = x is the ideal reference line along with a gray colored 50% error band; colormap indicates generalization gap Ltest−Ltrain. We ran 324 configurations fro… view at source ↗
Figure 2
Figure 2. Figure 2: TPV stability on synthetic data: Analogous to [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TPV stability on CIFAR-10: Analogous to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical and theoretical TPV estimates under label noise on synthetic data for noise standard deviation σ = 0.01. As width increases, both TPV estimates reduce. Further, TPV correlates with test loss. 0.000 0.001 0.002 0.003 0.004 0.005 Empirical TPV 0.68 0.70 0.72 0.74 0.76 0.78 0.80 0.82 Validation Accuracy resnet50 resnet18 wide_resnet50_2 shufflenet_v2_x1_0 efficientnet_b0 mnasnet1_0 convnext_tiny TPV… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical TPV estimates under label noise on Imagenet for noise standard deviation σ = 0.01. TPV stability holds and models that generalize better typically have lower TPV estimates. 5.2. Empirical Evaluation of TPV Under Label Noise We now empirically study the TPV term appearing in Theo￾rem 3.2. Recall that for additive zero-mean label noise with variance σ 2 ε , the label-noise contribution to test pred… view at source ↗
Figure 6
Figure 6. Figure 6: Empirical and theoretical TPV estimates under label noise on synthetic data for noise σ = 0.1. As width increases, theoretical TPV and empirical test set TPV reduce, but training set TPV increases, breaking TPV stability when σ is large. pendix G.2.1 [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: TPV stability on CIFAR-10: Analogous to [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TPV stability on CIFAR-100: Analogous to [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Empirical TPV estimates under target logit noise on CIFAR-10 for noise standard deviation σ = 0.01. Both TPV estimates reduce as width increases and correlate with the test set cross-entropy loss of the reference model. 500 1000 1500 Width 0.4 0.6 0.8 1.0 T_base 0.03 0.04 0.05 Generalization Gap T_base vs Width =0.01 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Generalization gap and Tbase vs. network width. As width increases, both quantities reduce. To estimate empirical TPV, for each pair (width, σ) with σ ∈ {0.01, 0.05, 0.1, 0.2}, we run 50 independent Monte Carlo trials. In each trial we add i.i.d. noise ϵ ∼ N (0, σ2 ) to the training labels, re-initialize the model at w ⋆ , and retrain using identical optimization settings for 500 epochs, and no proximity … view at source ↗
Figure 13
Figure 13. Figure 13: Pruning results of various criteria on Cifar-10 with ResNet-56 (left) and Cifar-100 with VGG-19 (right). JBR matches or outperforms existing methods. 1 2 3 4 # MACs 1e9 0.0 0.2 0.4 0.6 Validate Accuracy JBR Jacobian Random L1 FPGM BN Scale WHC Taylor 1 2 3 # MACs 1e8 0.0 0.2 0.4 0.6 Validate Accuracy Random L1 FPGM BN Scale WHC Taylor JBR Jacobian [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Pruning results of various criteria on ImageNet dataset using ResNet-50 (left) and MobileNet-v2 (right) without fine-tuning. JBR matches or outperforms existing methods. H.3. Pruning Experiments We evaluate whether the TPV-motivated pruning criterion (JBR) improves accuracy–compression tradeoffs relative to standard groupwise criteria. Following the OBC pruning protocol, we perform global channel pruning … view at source ↗
Figure 15
Figure 15. Figure 15: Pruning results of various criteria on Cifar-10 with ResNet-56 (left) and Cifar-100 with VGG-19 (right). JBR matches or outperforms existing methods. 0 20 40 Pruned Filters (%) 0.0 0.2 0.4 0.6 Validate Accuracy JBR Jacobian Random L1 FPGM BN Scale WHC Taylor 0 20 40 Pruned Filters (%) 0.0 0.2 0.4 0.6 Validate Accuracy Random L1 FPGM BN Scale WHC Taylor JBR Jacobian [PITH_FULL_IMAGE:figures/full_fig_p032_… view at source ↗
Figure 16
Figure 16. Figure 16 [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
read the original abstract

We introduce test prediction variance (TPV)--the first-order sensitivity of a trained model's outputs to parameter perturbations--as a unifying framework for analyzing post-training robustness. TPV is a fully label-free object whose trace form separates the geometry of the trained model from the specific perturbation mechanism, placing SGD noise, label noise, quantization, and pruning under a single lens. The resulting expressions recover the wide-minima hypothesis for SGD and quantization noise, and yield a distinct Jacobian-spectral characterization for label noise connecting label-noise TPV with benign overfitting in nonlinear networks. Theoretically, we prove that training-set TPV converges to its test-set counterpart in the overparameterized limit, irrespective of generalization performance, providing the first result that prediction variance under local parameter perturbations can be inferred from training inputs alone. Empirically, this stability holds far more broadly, including at very low widths. Further, TPV correlates well with test loss, enabling practical applications: JBR, a label-free pruning criterion derived from TPV geometry matching state-of-the-art baselines; and training-set based model selection signal for in-distribution and transfer learning scenarios. Code available at github.com/devansharpit/TPV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Test Prediction Variance (TPV) as the first-order sensitivity of a trained model's outputs to parameter perturbations around the trained weights. TPV is presented as a label-free unifying framework whose trace form separates model geometry from the perturbation mechanism, recovering the wide-minima hypothesis for SGD and quantization while providing a Jacobian-spectral characterization for label noise. The central theoretical result is a proof that training-set TPV converges to its test-set counterpart in the overparameterized limit irrespective of generalization performance. Empirically, this convergence is shown to hold more broadly (including low widths), TPV correlates with test loss, and it yields a pruning criterion (JBR) competitive with state-of-the-art baselines plus a training-set-based signal for model selection.

Significance. If the convergence theorem and the first-order approximation are rigorously justified across mechanisms, the work offers a novel label-free route to analyzing post-training robustness and benign overfitting in nonlinear networks. The explicit separation of geometry from perturbation type and the availability of code for reproducibility are strengths. The result that prediction variance under local perturbations can be inferred from training inputs alone would be a useful addition to the literature on sensitivity analysis and model compression.

major comments (2)
  1. [§3] §3 (TPV definition and Taylor expansion): The framework defines TPV via the first-order Jacobian trace and claims unification for pruning and quantization. However, pruning (hard zeroing) and quantization (finite discrete shifts) induce finite perturbations where the Taylor remainder is not necessarily negligible. No explicit bounds on the remainder term or regime of validity are provided showing that the linear term dominates uniformly for both training and test inputs; if the higher-order contributions differ across distributions, the claimed train-to-test convergence would not transfer to these mechanisms.
  2. [Theorem 1] Theorem 1 (overparameterized convergence): The proof establishes convergence of training-set TPV to test-set TPV under the first-order approximation. It is unclear whether the argument extends when the perturbation operator itself (e.g., the discrete mask for pruning) introduces distribution-dependent higher-order effects that are not controlled by the overparameterization assumption alone.
minor comments (2)
  1. [Abstract] The abstract states that TPV 'recovers the wide-minima hypothesis'; a brief pointer to the precise recovered statement (e.g., which prior result on flatness) would help readers.
  2. [Empirical evaluation] In the empirical sections, clarify whether the reported correlations between TPV and test loss are computed on the same models used for the convergence plots or on held-out architectures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and insightful comments on the scope of the first-order approximation. We address each major point below and will incorporate clarifications into the revised manuscript to better delineate the regime in which TPV serves as a unifying lens.

read point-by-point responses
  1. Referee: [§3] §3 (TPV definition and Taylor expansion): The framework defines TPV via the first-order Jacobian trace and claims unification for pruning and quantization. However, pruning (hard zeroing) and quantization (finite discrete shifts) induce finite perturbations where the Taylor remainder is not necessarily negligible. No explicit bounds on the remainder term or regime of validity are provided showing that the linear term dominates uniformly for both training and test inputs; if the higher-order contributions differ across distributions, the claimed train-to-test convergence would not transfer to these mechanisms.

    Authors: TPV is defined in Section 3 precisely as the first-order term Tr(J Σ J^T) obtained from the linearization of the network output around the trained weights. The unification claim is therefore with respect to this local sensitivity measure: different mechanisms enter only through the choice of the perturbation second-moment matrix Σ (or its empirical counterpart), while the geometry is captured by the Jacobian. For finite perturbations such as hard pruning or quantization steps, the full change in prediction indeed includes higher-order terms; however, the manuscript positions TPV as a computationally tractable, label-free proxy for local robustness rather than an exact global characterization. We agree that the absence of explicit remainder bounds leaves open the question of when the linear term dominates uniformly. In the revision we will add a short paragraph in Section 3 stating that the framework applies directly when perturbations are small relative to the local curvature (e.g., low-bit quantization or pruning of low-magnitude weights) and will cite the empirical success of the JBR pruning rule in Section 5 as supporting evidence that the first-order signal remains informative even for finite masks. We will also clarify that Theorem 1 concerns convergence of the first-order TPV itself and does not automatically extend to the full nonlinear prediction change under data-dependent masks. revision: partial

  2. Referee: [Theorem 1] Theorem 1 (overparameterized convergence): The proof establishes convergence of training-set TPV to test-set TPV under the first-order approximation. It is unclear whether the argument extends when the perturbation operator itself (e.g., the discrete mask for pruning) introduces distribution-dependent higher-order effects that are not controlled by the overparameterization assumption alone.

    Authors: The proof of Theorem 1 proceeds by showing that, in the overparameterized regime, the empirical Jacobian on the training set concentrates around its population counterpart, causing the trace expression for TPV to coincide for train and test inputs irrespective of the particular perturbation covariance Σ. Because the argument is formulated entirely at the level of the first-order term, it does not claim control over higher-order remainders that would arise from a nonlinear perturbation operator (such as a data-dependent pruning mask). We will revise the statement of Theorem 1 and the paragraph immediately following it to emphasize that the result applies to any fixed perturbation distribution Σ and that, for pruning, the mask is chosen on the basis of the TPV geometry itself; any additional distribution shift induced by the mask selection step lies outside the current theorem. The manuscript already reports that the observed train-to-test stability of TPV holds empirically well beyond the overparameterized regime (including low-width networks), which provides practical reassurance even if a fully rigorous extension to nonlinear operators remains open. revision: partial

Circularity Check

0 steps flagged

No significant circularity: TPV convergence theorem is a self-contained mathematical result from sensitivity analysis

full rationale

The paper defines TPV explicitly as the first-order sensitivity (Jacobian trace) of model outputs to parameter perturbations around trained weights. It then derives unifying expressions for SGD noise, label noise, quantization, and pruning by separating this geometry from the perturbation mechanism. The central theoretical result—a proof that training-set TPV converges to test-set TPV in the overparameterized limit irrespective of generalization—is presented as an independent mathematical statement, not obtained by fitting parameters to target data or by renaming inputs. No self-citations are invoked as load-bearing premises, no ansatz is smuggled via prior work, and no prediction reduces by construction to a fitted quantity. The framework remains self-contained against external benchmarks; applications such as the JBR pruning criterion are downstream uses rather than definitional. This is the expected honest non-finding for a sensitivity-analysis paper whose derivations do not tautologically reproduce their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed from abstract only; the central claims rest on the validity of the first-order sensitivity definition and the overparameterized regime assumption for the convergence result.

axioms (2)
  • domain assumption First-order Taylor expansion around trained parameters suffices to characterize robustness to the listed perturbation mechanisms
    Implicit in the definition of TPV as first-order sensitivity
  • domain assumption Overparameterized limit in which training-set TPV converges to test-set TPV
    Stated as the setting for the main theoretical result

pith-pipeline@v0.9.0 · 5737 in / 1476 out tokens · 44982 ms · 2026-05-21T16:53:46.929823+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Proceedings of the National Academy of Sciences105(38), 14615–14619 (2008) https://doi.org/10.1073/pnas

    URL http://proceedings.mlr.press/ v97/allen-zhu19a/allen-zhu19a.pdf. Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R. R., and Wang, R. On exact computation with an infinitely wide neural net.Advances in neural information process- ing systems, 32, 2019. Bar, A., Mulayoff, R., Michaeli, T., and Talmon, R. The ex- pected loss of preconditioned langev...

  2. [2]

    Pruning Convolutional Neural Networks for Resource Efficient Inference

    PMLR, 2021. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through net- work slimming. InProceedings of the IEEE international conference on computer vision, pp. 2736–2744, 2017. Mandt, S., Hoffman, M. D., and Blei, D. M. A variational analysis of stochastic gradient algorithms. InProceed- ings of th...

  3. [3]

    Start from the reference parametersθ ⋆

  4. [4]

    Perturb the training labels:˜y=y+ε

  5. [5]

    Retrain for a small number of steps using the objective Lnoisy(θ) + γ 2 ∥w−w ⋆∥2 2, which ensures optimization remains in the local neighborhood where the linearized approximation is valid

  6. [6]

    Evaluate prediction changes ontest data, implicitly incorporating the test JacobianJ te

  7. [7]

    While it does not yield theexactminimum–norm linearized solution, it provides a robust local approximation to the TPV dynamics in regimes where the exact computation is infeasible

    Estimate TPV as the empirical variance of the test predictions across multiple independent perturbations. While it does not yield theexactminimum–norm linearized solution, it provides a robust local approximation to the TPV dynamics in regimes where the exact computation is infeasible. Important Practical Considerations:There are a couple of important pra...

  8. [8]

    To achieve this faithfully, training must be done in eval mode, i.e., modules like batch norm and dropout should not be active

    Models need to be trained in eval mode: we perturb the logits of the clean model’s prediction with Gaussian noise and train a copy of the reference model to fit these new targets, which act as infinitesimal change in targets. To achieve this faithfully, training must be done in eval mode, i.e., modules like batch norm and dropout should not be active. The...

  9. [9]

    different mini-batch in each epoch)

    Mini-batch shuffling: All sources of randomness other than label noise should be removed as much as possible to isolate the effect of label noise when measuring TPV (e.g. different mini-batch in each epoch). In practice, we do use mini-batch SGD for noisy label fine-tuning for efficiency and to make the training loss go down in some cases. However, we ens...

  10. [10]

    This can be easily overlooked, and if the loss does not go down or diverges, it can easily lead to incorrect TPV estimates

    MSE training loss must go down during training. This can be easily overlooked, and if the loss does not go down or diverges, it can easily lead to incorrect TPV estimates. We found this to be especially true in the case of ImageNet experiments, where is was extremely difficult to fit noisy target logits. E. TPV for SGD Stationary Noise Setup.Consider a sc...

  11. [11]

    Details for CIFAR-100 (Fig

    Vary Width Experiment Details: We describe the details for the CIFAR-10 experiment in Fig 3 below. Details for CIFAR-100 (Fig. 9) are similar except the output has 100 dimensional logits and we use analogous CIFAR-100 pre-trained architectures. Dataset and preprocessing.We use the standard CIFAR-10 per-channel normalization. From these, we randomly subsam...

  12. [12]

    8), we use pretrained ResNet–20/32/44/56 models on CIFAR-10

    Vary Number of Samples Experiment Details: For the experiment with varying number of training samples (Fig. 8), we use pretrained ResNet–20/32/44/56 models on CIFAR-10. Notice these architectures have different depth, which is not a consideration in the TPV theory, and is merely used as a source of variation in our experiments. The experimental details ar...

  13. [13]

    and can be seen as a label-free version of JC. Both JBR and JC assign a score to each parameter group g of the form score(wg) =E x w⊤ g Jg(x)⊤ m(x)m(x)⊤ Jg(x)wg =E x (m(x)⊤vg(x))2 , where vg(x) =J g(x)wg is the logit-space direction induced by group g, and the only difference between the two methods lies in the choice of the logit–space vectorm(x): mJC(x)...