The Neural Tangent Kernel for Classification

Alvaro Cartea; Jonathan Plenk; Kamil Ciosek; Mark van der Wilk; Sergio Calvo-Ordonez; Yarin Gal

arxiv: 2605.17606 · v2 · pith:KJ54URAYnew · submitted 2026-05-17 · 💻 cs.LG

The Neural Tangent Kernel for Classification

Jonathan Plenk , Sergio Calvo-Ordonez , Alvaro Cartea , Yarin Gal , Mark van der Wilk , Kamil Ciosek This is my paper

Pith reviewed 2026-06-30 18:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural tangent kernelclassificationcross-entropy losslazy trainingwide neural networksregularizationlinearized modelmodel uncertainty

0 comments

The pith

Wide neural networks stay in the lazy regime for cross-entropy when regularized or targets are non-degenerate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends Neural Tangent Kernel theory to classification by identifying conditions that keep the NTK constant during training. Parameter-space regularization maintains this constancy for cross-entropy loss. Without regularization the constant-NTK regime returns when every class has strictly positive probability. Under either condition the full nonlinear training is well approximated by the linearized model, which supplies an explicit solution expressed through the NTK. The distribution of predictors obtained from random initialization is also related to Bayesian uncertainty estimates.

Core claim

In the infinite-width limit, wide neural networks trained on classification losses remain in the lazy training regime when either parameter-space regularization is applied or when the target distributions are non-degenerate, meaning every class has positive probability. This constancy of the NTK allows training to be approximated by the linearized model, which yields an explicit characterization of the trained predictor in terms of the NTK. The distribution of such predictors over random initializations can be related to Bayesian posterior predictive distributions.

What carries the argument

The Neural Tangent Kernel, shown to remain constant under regularization or non-degenerate targets for cross-entropy loss, which enables linearization of the training dynamics.

If this is right

Training dynamics for classification become explicitly characterizable using the NTK.
The trained predictor admits a closed-form expression in terms of the NTK under the stated conditions.
The distribution of predictors induced by random initialization supplies a concrete notion of model uncertainty that connects to Bayesian methods.
The lazy-training regime applies to cross-entropy loss once regularization or non-degenerate targets are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditions may allow generalization bounds for classification to be derived directly from the NTK, mirroring regression results.
Finite-width networks could be monitored during training to quantify how far they deviate from the constant-NTK regime as a function of regularization strength.
Similar constancy arguments might extend to other losses that involve nonlinear output maps once appropriate regularization or target conditions are identified.

Load-bearing premise

The network must be in the infinite-width limit so that the NTK remains approximately constant throughout training.

What would settle it

Train a wide but finite network on cross-entropy loss without regularization using targets where at least one class has zero probability and check whether the empirical NTK changes appreciably during training.

Figures

Figures reproduced from arXiv: 2605.17606 by Alvaro Cartea, Jonathan Plenk, Kamil Ciosek, Mark van der Wilk, Sergio Calvo-Ordonez, Yarin Gal.

**Figure 1.** Figure 1: 1d-classification with 3 classes. Left: An ensemble over wide networks, starting from different parameter initializations. Right: The infinite-width limit of the ensemble, using the functionspace ODE. 4 Connection of the infinite-width ensemble to Bayesian methods The previous section characterized the trained linearized predictor through the inverse map Φ −1 . We now use this characterization to study th… view at source ↗

**Figure 2.** Figure 2: Blue: Pre-softmax NTK (constant). Red: Post-softmax NTK (not constant). 5.2 MNIST Classification Following Yu et al. [2025] we train a four-layer fully connected neural network on MNIST [LeCun et al., 2002] using 2 classes (odd or even). By using softmax with a reference class, the logit dimension is 1 and thus the kernel is scalar-valued. We plot the evolution of the empirical NTK t 7→ Θˆ θt (x, x) during… view at source ↗

**Figure 3.** Figure 3: The NTK for MNIST classification does not diverge when using label smoothing or [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The function-space Brier score with a regularizer can have multiple stationary points. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 4.** Figure 4: The function-space Brier score with a regularizer can have multiple stationary points. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

read the original abstract

In wide neural networks, the Neural Tangent Kernel (NTK) remains approximately constant during training, providing a powerful theoretical tool for studying training dynamics, generalization, and connections to kernel methods. However, this theory is largely restricted to regression losses. It was previously thought that training on a classification loss, or more generally losses involving nonlinear output transformations, breaks this property, leading to divergent logits and a breakdown of the linearization. In this paper, we extend NTK theory to classification by identifying conditions under which wide neural networks remain in the lazy training regime. We show that parameter-space regularization ensures a constant NTK during training for cross-entropy loss, while in the absence of regularization the regime is recovered when targets are non-degenerate, i.e. when all classes have strictly positive probability. Under these conditions, training is well-approximated by the linearized model, yielding an explicit characterization of the solution in terms of the NTK. We further analyze the distribution of trained predictors induced by random initialization and relate this notion of model uncertainty to Bayesian methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete conditions (regularization or non-degenerate targets) that keep the NTK constant for cross-entropy, recovering the linearized regime for classification.

read the letter

The central point is that parameter-space regularization or non-degenerate class probabilities let wide networks stay in the lazy regime on cross-entropy, so the NTK stays roughly constant and training is well approximated by the linear model with an explicit NTK solution. That is the main new piece relative to the regression-focused NTK literature.

It does the extension cleanly. The conditions are stated directly, the link back to the linearized predictor is explicit, and the random-initialization analysis that connects to Bayesian uncertainty is a useful extra. These are the parts that actually move the theory forward for classification.

The soft spots are the usual ones for this line of work. Everything still requires the infinite-width limit so the kernel stays constant; that is not new but it limits how far the results travel to finite nets. The non-degenerate target condition is plausible on paper but could be fragile on real imbalanced data, and without the full derivations it is hard to judge how tightly the regularization has to be tuned. No load-bearing circularity or hidden fitting shows up in the abstract or stress-test.

This is for people already working on NTK theory, kernel approximations, or lazy training. A reader who cares about closing the regression-to-classification gap will find the conditions and the explicit solution useful. It is worth sending to serious referees because the claim is focused, the gap it fills is real, and the argument appears internally consistent even if the practical scope remains narrow.

Referee Report

2 major / 2 minor

Summary. The paper extends Neural Tangent Kernel (NTK) theory, previously limited to regression, to classification with cross-entropy loss. It identifies two conditions under which wide networks remain in the lazy regime with approximately constant NTK: (i) parameter-space regularization, and (ii) non-degenerate targets (all classes having strictly positive probability) without regularization. Under these conditions the training dynamics are well-approximated by the linearized model, yielding an explicit NTK-based characterization of the solution; the work also analyzes the distribution of predictors induced by random initialization and its relation to Bayesian methods.

Significance. If the derivations hold, the result meaningfully broadens the NTK framework to the classification setting that dominates practical applications. The explicit characterization and the Bayesian connection supply new analytic tools for dynamics, generalization, and uncertainty in classification, while the stated conditions clarify when the lazy-regime approximation remains valid.

major comments (2)

[Main derivation of constant NTK under regularization] The central claim that parameter-space regularization keeps the NTK exactly constant for cross-entropy loss rests on a derivation that must be verified in the main text; without seeing the precise form of the regularizer and the resulting ODE for the kernel, it is impossible to confirm that the constancy is not an artifact of the linearization assumption itself.
[Section on non-degenerate targets] The non-degenerate-target condition (all classes have strictly positive probability) is invoked to recover the lazy regime without regularization. It is unclear whether this condition is necessary or merely sufficient; a counter-example or a relaxation to weaker positivity requirements would strengthen the result.

minor comments (2)

Notation for the output transformation and the target distribution should be introduced once and used uniformly; several symbols appear to be redefined between the abstract and the technical sections.
The discussion relating the induced predictor distribution to Bayesian methods would benefit from an explicit comparison (e.g., to the NTK-GP posterior) rather than a high-level statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Main derivation of constant NTK under regularization] The central claim that parameter-space regularization keeps the NTK exactly constant for cross-entropy loss rests on a derivation that must be verified in the main text; without seeing the precise form of the regularizer and the resulting ODE for the kernel, it is impossible to confirm that the constancy is not an artifact of the linearization assumption itself.

Authors: We agree that the derivation requires clearer presentation in the main text. The regularizer is the standard squared L2 penalty on the parameters. Under the infinite-width NTK linearization, the gradient flow on the regularized cross-entropy loss yields an ODE in which the kernel remains exactly constant because the parameter updates remain infinitesimal and the feature map is frozen at initialization. We will move the explicit regularizer form and the resulting kernel ODE from the appendix into Section 3 of the main text, together with a short paragraph explaining why the constancy is a direct consequence of the regularized dynamics rather than an artifact of linearization. revision: yes
Referee: [Section on non-degenerate targets] The non-degenerate-target condition (all classes have strictly positive probability) is invoked to recover the lazy regime without regularization. It is unclear whether this condition is necessary or merely sufficient; a counter-example or a relaxation to weaker positivity requirements would strengthen the result.

Authors: The condition is stated as sufficient: when every class probability is bounded away from zero, the logits remain bounded and the NTK stays approximately constant. We do not claim necessity. We will add a clarifying paragraph in Section 4 noting that the condition is sufficient for our proof technique and briefly discussing why weaker positivity (e.g., targets that can approach zero) may allow divergence in some cases. A rigorous counter-example demonstrating necessity would require constructing a specific degenerate target distribution for which the lazy regime nevertheless holds; while we can add a short remark on this open direction, a full counter-example lies outside the scope of the present work. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper extends the infinite-width NTK linearization (a standard external assumption) to cross-entropy loss by deriving conditions under which the NTK stays constant: parameter-space regularization or non-degenerate targets. This produces an explicit solution characterization in terms of the NTK. No step reduces by construction to a fitted parameter renamed as prediction, a self-definitional loop, or a load-bearing self-citation chain; the derivation is self-contained against the usual NTK regime and does not import uniqueness theorems or ansatzes from the authors' prior work. The central claim therefore adds independent content rather than renaming or tautologically recovering its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the infinite-width NTK regime assumption standard in the field; no free parameters, invented entities, or ad-hoc axioms are mentioned in the abstract.

axioms (1)

domain assumption Infinite-width limit keeps NTK constant during training
Invoked to justify linearization for both regression and the new classification setting.

pith-pipeline@v0.9.1-grok · 5727 in / 1133 out tokens · 19589 ms · 2026-06-30T18:50:22.638264+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Richer Bayesian Last Layers with Subsampled NTK Features

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, Yarin Gal, José Miguel Hernández-Lobato, and Kamil Ciosek. Richer bayesian last layers with subsampled ntk features. arXiv preprint arXiv:2602.01279, 2026a. Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, José Miguel Hernández- Lobato, Konstantina Palla, and Kamil Ci...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Diagonalizing the softmax: Hadamard initialization for tractable cross-entropy dynamics.arXiv preprint arXiv:2512.04006,

Connall Garrod, Jonathan P Keating, and Christos Thrampoulidis. Diagonalizing the softmax: Hadamard initialization for tractable cross-entropy dynamics.arXiv preprint arXiv:2512.04006,

work page arXiv
[3]

An unconstrained layer-peeled perspective on neural collapse.arXiv preprint arXiv:2110.02796,

Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse.arXiv preprint arXiv:2110.02796,

work page arXiv
[4]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890,

work page arXiv 1906
[5]

Gaussian Process Behaviour in Wide Deep Neural Networks

Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.arXiv preprint arXiv:1804.11271,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Tensor programs ii: Neural tangent kernel for any architecture.arXiv preprint arXiv:2006.14548,

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture.arXiv preprint arXiv:2006.14548,

work page arXiv 2006
[7]

We are using the Euclidean norm on the K- dimensional output dimension, after applying the Euclidean and spectral norm on the parameter dimensions p and p×p respectively

prove for a standard feedforward neural network (as well as ResNets and CNNs with large number of channels): Lemma 2.1.For any δ0 >0 there are K ′ 1, K′ 2 >0 such that: For every radius R >0 there is large enough layer width n such that with probability 1−δ 0 over random initialization θ0: For any input x∈M d: ∀θ∈B(θ 0, R) :∥J θ(x)∥2,2 ≤K ′ 1,(1) ∀θ∈B(θ 0...

2018
[8]

Then fθ0(·) converges in distribution to a Gaussian process with zero mean and covariance given by the NNGP Kernel K: For inputs x1,

prove: Lemma 2.2.Consider random initialization θ0. Then fθ0(·) converges in distribution to a Gaussian process with zero mean and covariance given by the NNGP Kernel K: For inputs x1, . . . ,xN ∈M d, fθ0(x) d − → N(0,K(x,x)).(3) This directly implies that the network values are in a compact set at initialization: Lemma A.1.For any δ0 >0 , there is K ′ 0 ...

2018
[9]

(3) Function-space PL∀z∈ S 0 :∥∇ zC(z)∥2 2 ≥2µ C (C(z)−infC)

(2) Bounded gradient growth∀z∈ S 0 :∥∇ zC(z)∥2 2 ≤2K 2 (C(z)−infC). (3) Function-space PL∀z∈ S 0 :∥∇ zC(z)∥2 2 ≥2µ C (C(z)−infC). B Properties of the function-space loss In this section we introduce various assumptions on the function-space loss and discuss its properties. Table 1 provides an overview. We write forx 1, . . . ,xN ∈M d: fθ :=f θ(x) := (fθ(x...

2020
[10]

(65) Oymak and Soltanolkotabi [2019], Liu et al

2019
[11]

[2022], where it was presented for discrete-time gradient descent

Then there are R, c0 >0 such that for large enough layer width n: With probability 1−δ 0 over random initializationθ 0, for allt≥0, d dt θt 2 ≤η 0c0Re−c0η0t and thus∥θ t −θ 0∥2 ≤R.(146) 21 The proof closely follows Oymak and Soltanolkotabi [2019], Liu et al. [2022], where it was presented for discrete-time gradient descent. Proof.Recall that by Lemma 2.3 ...

2019

[1] [1]

Richer Bayesian Last Layers with Subsampled NTK Features

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, Yarin Gal, José Miguel Hernández-Lobato, and Kamil Ciosek. Richer bayesian last layers with subsampled ntk features. arXiv preprint arXiv:2602.01279, 2026a. Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Álvaro Cartea, José Miguel Hernández- Lobato, Konstantina Palla, and Kamil Ci...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Diagonalizing the softmax: Hadamard initialization for tractable cross-entropy dynamics.arXiv preprint arXiv:2512.04006,

Connall Garrod, Jonathan P Keating, and Christos Thrampoulidis. Diagonalizing the softmax: Hadamard initialization for tractable cross-entropy dynamics.arXiv preprint arXiv:2512.04006,

work page arXiv

[3] [3]

An unconstrained layer-peeled perspective on neural collapse.arXiv preprint arXiv:2110.02796,

Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse.arXiv preprint arXiv:2110.02796,

work page arXiv

[4] [4]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890,

work page arXiv 1906

[5] [5]

Gaussian Process Behaviour in Wide Deep Neural Networks

Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks.arXiv preprint arXiv:1804.11271,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Tensor programs ii: Neural tangent kernel for any architecture.arXiv preprint arXiv:2006.14548,

Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture.arXiv preprint arXiv:2006.14548,

work page arXiv 2006

[7] [7]

We are using the Euclidean norm on the K- dimensional output dimension, after applying the Euclidean and spectral norm on the parameter dimensions p and p×p respectively

prove for a standard feedforward neural network (as well as ResNets and CNNs with large number of channels): Lemma 2.1.For any δ0 >0 there are K ′ 1, K′ 2 >0 such that: For every radius R >0 there is large enough layer width n such that with probability 1−δ 0 over random initialization θ0: For any input x∈M d: ∀θ∈B(θ 0, R) :∥J θ(x)∥2,2 ≤K ′ 1,(1) ∀θ∈B(θ 0...

2018

[8] [8]

Then fθ0(·) converges in distribution to a Gaussian process with zero mean and covariance given by the NNGP Kernel K: For inputs x1,

prove: Lemma 2.2.Consider random initialization θ0. Then fθ0(·) converges in distribution to a Gaussian process with zero mean and covariance given by the NNGP Kernel K: For inputs x1, . . . ,xN ∈M d, fθ0(x) d − → N(0,K(x,x)).(3) This directly implies that the network values are in a compact set at initialization: Lemma A.1.For any δ0 >0 , there is K ′ 0 ...

2018

[9] [9]

(3) Function-space PL∀z∈ S 0 :∥∇ zC(z)∥2 2 ≥2µ C (C(z)−infC)

(2) Bounded gradient growth∀z∈ S 0 :∥∇ zC(z)∥2 2 ≤2K 2 (C(z)−infC). (3) Function-space PL∀z∈ S 0 :∥∇ zC(z)∥2 2 ≥2µ C (C(z)−infC). B Properties of the function-space loss In this section we introduce various assumptions on the function-space loss and discuss its properties. Table 1 provides an overview. We write forx 1, . . . ,xN ∈M d: fθ :=f θ(x) := (fθ(x...

2020

[10] [10]

(65) Oymak and Soltanolkotabi [2019], Liu et al

2019

[11] [11]

[2022], where it was presented for discrete-time gradient descent

Then there are R, c0 >0 such that for large enough layer width n: With probability 1−δ 0 over random initializationθ 0, for allt≥0, d dt θt 2 ≤η 0c0Re−c0η0t and thus∥θ t −θ 0∥2 ≤R.(146) 21 The proof closely follows Oymak and Soltanolkotabi [2019], Liu et al. [2022], where it was presented for discrete-time gradient descent. Proof.Recall that by Lemma 2.3 ...

2019