Continual Rare-Class Recognition with Emerging Novel Subclasses

Hung Nguyen; Leman Akoglu; Xuejian Wang

arxiv: 1906.12218 · v1 · pith:S3HYBJF4new · submitted 2019-06-28 · 💻 cs.LG · stat.ML

Continual Rare-Class Recognition with Emerging Novel Subclasses

Hung Nguyen , Xuejian Wang , Leman Akoglu This is my paper

Pith reviewed 2026-05-25 13:34 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords continual learningrare class recognitionemerging subclassesimbalanced data streamsnovel class detectionminority class learningstreaming classification

0 comments

The pith

RaRecognize maintains a general rare-majority boundary separate from specialized subclass models to detect both known and emerging rare classes in streams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a method for continual recognition of a rare class in streaming data, where new subclasses of that rare class can appear after training. It builds one general learner to separate rare instances from the abundant majority class and additional specialized learners for the rare subclasses already seen in the data. Because the general learner is constructed to differ from the specialized ones, it can label future instances that belong to previously unseen rare subclasses as new rather than forcing them into known categories or the majority. A reader would care because the approach avoids retraining on every new majority instance and keeps the overall model size from growing rapidly while still handling the appearance of novel rare patterns.

Core claim

RaRecognize estimates a general decision boundary between the rare and majority classes that is kept dissimilar by construction from the specialized learners trained on individual rare subclasses present in the initial data. This separation lets the system recognize recurrent rare subclasses, flag instances from unseen rare subclasses as newly emerging, and discard all instances labeled as majority, thereby limiting both test-time cost and long-term model growth.

What carries the argument

RaRecognize, a two-part learner consisting of a general minority-majority boundary kept dissimilar from specialized rare-subclass models.

If this is right

Only specialized models for rare subclasses are retained, so total model size grows only when new rare subclasses appear.
All instances labeled majority at test time are ignored, reducing computation on the large common class.
Both recurrent and previously unseen rare subclasses are handled without requiring the general boundary to be retrained on every new instance.
Performance gains are shown on three real-world document streams containing corporate-risk and disaster events as rare classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of general and specialized learners could be tested on other streaming tasks where minority patterns evolve, such as fraud detection or sensor anomaly streams.
If the dissimilarity construction works as claimed, it offers a lightweight alternative to full replay-based continual learning methods that store all past data.
One could measure how much dissimilarity between the general and specialized components is actually achieved on real data to check whether the construction is the main driver of generalization.

Load-bearing premise

The construction that keeps the general boundary learner dissimilar from the specialized subclass learners is sufficient to prevent overfitting to seen rare instances and allow detection of truly new rare subclasses.

What would settle it

An experiment on a held-out stream containing a new rare subclass that is similar in feature distribution to the training majority class, where the method either labels the new subclass instances as majority or fails to mark them as emerging.

Figures

Figures reproduced from arXiv: 1906.12218 by Hung Nguyen, Leman Akoglu, Xuejian Wang.

**Figure 2.** Figure 2: Within- and cross-coverage rates of Vk’s and V0 (resp.) for Rk’s and R. (V1) Cyber attack (V2) Sexual assault (V3) Money laundering (V0) General risk [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Wordclouds representing 3 example risk subclasses and the overall risk [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Precision (seen), Recall (seen) and Recall (unseen) of methods on RiskDoc (RaRecognize-ICA achieves the best balance between Precision and Recall on both seen and unseen test instances). From [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Word clouds for the weights of general and 3 specialized classifiers in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 7.** Figure 7: Time-Performance trade-off of all compared methods. early with the data size. RaRecognize with PCA is faster than that with ICA thanks to no feature correlations (i.e. x T [p] x[q] 2 dropped). In [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Top-level classification: Precision (seen), Recall (seen) and Recall (unseen) of methods on Risk-Sen dataset. 0 0.2 0.4 0.6 0.8 SENCForest L2AC BASELINE BASELINE-r RARECOGNIZE-1K RARECOGNIZE-PCA RARECOGNIZE-ICA Precision (seen) Recall (seen) Recall (unseen) [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Top-level classification: Precision (seen), Recall (seen) and Recall (unseen) of methods on NYT-Dstr dataset. A.4 Interpretability: Wordclouds for RaRecognize in all three datasets Risk-Doc, Risk-Sen and NYT-Dstr [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Interpretability: Word clouds representing learned weights by general [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Interpretability: Word clouds representing learned weights by general [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Interpretability: Word clouds representing learned weights by general [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

read the original abstract

Given a labeled dataset that contains a rare (or minority) class of of-interest instances, as well as a large class of instances that are not of interest, how can we learn to recognize future of-interest instances over a continuous stream? We introduce RaRecognize, which (i) estimates a general decision boundary between the rare and the majority class, (ii) learns to recognize individual rare subclasses that exist within the training data, as well as (iii) flags instances from previously unseen rare subclasses as newly emerging. The learner in (i) is general in the sense that by construction it is dissimilar to the specialized learners in (ii), thus distinguishes minority from the majority without overly tuning to what is seen in the training data. Thanks to this generality, RaRecognize ignores all future instances that it labels as majority and recognizes the recurrent as well as emerging rare subclasses only. This saves effort at test time as well as ensures that the model size grows moderately over time as it only maintains specialized minority learners. Through extensive experiments, we show that RaRecognize outperforms state-of-the art baselines on three real-world datasets that contain corporate-risk and disaster documents as rare classes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RaRecognize separates a general rare-vs-majority boundary from per-subclass specialists to flag emerging rare classes in streams, but the claimed dissimilarity by construction has no visible enforcement mechanism.

read the letter

The paper's main move is to train one general boundary that separates the rare class from the majority, plus separate detectors for each known rare subclass, with the general boundary deliberately kept dissimilar from those specialists. This is supposed to let the general part catch new subclasses that never appeared in training while the system only keeps the minority specialists around, skipping most majority instances at test time. That setup directly targets the practical cost of continual monitoring on imbalanced streams like risk or disaster documents, and the model-size growth stays linear in the number of observed rare subclasses rather than exploding. The experiments report better numbers than baselines on three real datasets, which is the concrete evidence offered. The soft spot sits right at the central claim. The abstract states the general learner is dissimilar to the specialists by construction, yet supplies no loss term, architectural constraint, regularization, or training procedure that would actually produce that dissimilarity. If the two sets of models share features or optimization steps, the general boundary can still tune to the observed rare subclasses and lose its ability to generalize to unseen ones. Without seeing the exact mechanism or ablation that isolates the effect, the generality argument stays unverified. The experimental protocol details are also missing from the abstract, so it is hard to judge variance or whether the baselines were fairly matched. This work is aimed at researchers who already work on continual or imbalanced learning and need methods that stay efficient on streams. It is coherent enough on its own terms to deserve a serious referee who can check the implementation and the dissimilarity construction in detail.

Referee Report

1 major / 2 minor

Summary. The paper introduces RaRecognize for continual rare-class recognition in data streams. It (i) learns a general decision boundary separating the rare (minority) class from the majority class, (ii) trains specialized learners for known rare subclasses present in training data, and (iii) detects instances from previously unseen rare subclasses as emerging. The general boundary learner is asserted to be dissimilar to the specialized learners by construction, which is claimed to prevent over-tuning to observed training data and thereby enable generalization to new subclasses while keeping model growth moderate by ignoring majority instances at test time. The method is evaluated on three real-world datasets involving corporate-risk and disaster documents, where it reportedly outperforms state-of-the-art baselines.

Significance. If the dissimilarity mechanism can be made explicit and verified to support the claimed generality, the work addresses a practically relevant problem in continual learning under class imbalance, with potential efficiency gains from selective model expansion and test-time filtering. The experimental claim of outperformance on three datasets would be a concrete contribution if supported by full protocols and statistical detail.

major comments (1)

[Abstract] Abstract: The central claim that the general decision boundary learner is dissimilar to the specialized subclass learners 'by construction' (thereby enabling generalization to unseen subclasses) is asserted without any described architectural constraint, loss term, regularization, or verification procedure that would enforce or demonstrate the required dissimilarity. This mechanism is load-bearing for claims (i) and (iii) and cannot be assessed from the given description.

minor comments (2)

[Abstract] Abstract contains a repeated word ('of of-interest').
[Abstract] No mention of experimental protocol, error bars, statistical significance tests, or dataset characteristics (e.g., class imbalance ratios) is supplied in the abstract, making the outperformance claim difficult to interpret.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the recommendation for major revision. The primary concern raised is the lack of explicit description for the claimed dissimilarity between the general rare-vs-majority boundary learner and the specialized subclass learners. We address this point directly below and will revise the manuscript to make the mechanism fully explicit, including training differences and verification.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the general decision boundary learner is dissimilar to the specialized subclass learners 'by construction' (thereby enabling generalization to unseen subclasses) is asserted without any described architectural constraint, loss term, regularization, or verification procedure that would enforce or demonstrate the required dissimilarity. This mechanism is load-bearing for claims (i) and (iii) and cannot be assessed from the given description.

Authors: We agree that the abstract alone does not provide sufficient detail on the dissimilarity mechanism, and this needs to be addressed for the claims to be fully assessable. In the method, the general boundary learner is trained with a binary cross-entropy loss using only rare-vs-majority labels (no subclass information), while each specialized learner uses a multi-class loss on the known subclass labels within the rare class. This difference in supervision and objective function enforces dissimilarity by construction: the general learner cannot overfit to subclass-specific patterns because it never sees subclass labels. We will expand the method section with a new subsection explicitly contrasting the two training procedures, add a short verification experiment (e.g., comparing decision boundaries or feature attributions) demonstrating the dissimilarity, and revise the abstract to include a one-sentence reference to this construction. These changes will also strengthen the justification for generalization to emerging subclasses. revision: yes

Circularity Check

0 steps flagged

No significant circularity; generality asserted by construction without reduction to fit or self-citation

full rationale

The abstract asserts that the general decision boundary learner is dissimilar to the specialized subclass learners 'by construction' and therefore generalizes to unseen subclasses, but supplies no equations, loss terms, architectural constraints, or self-citations that would make this dissimilarity reduce tautologically to the inputs or to a fitted parameter. No derivation chain is shown that equates the claimed generality to a post-hoc fit or renames an input as a prediction. The method description remains self-contained against external benchmarks, consistent with a minor (non-load-bearing) assertion rather than circular reasoning.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is described at the level of high-level components without mathematical formulation.

pith-pipeline@v0.9.0 · 5735 in / 1132 out tokens · 27924 ms · 2026-05-25T13:34:14.411017+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the third term in Eq. (2) penalizes w0 being correlated with any wk … reformulate the model correlation penalty as μ/2 ∑_{p,q} (w0,p wk,q x[p]ᵀx[q])² … self-correlation penalty … Theorem 1. The joint loss function L … remains convex.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RaRecognize … estimates a general decision boundary … by construction it is dissimilar to the specialized learners

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Chen and B

Z. Chen and B. Liu. Lifelong machine learning.Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 10(3):1–145, 2016

work page 2016
[2]

R. French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3:128–135, 05 1999

work page 1999
[3]

Kemker and C

R. Kemker and C. Kanan. Fearnet: Brain-inspired model for incremental learning. In ICLR, 2018

work page 2018
[4]

Y. Kim. Convolutional neural networks for sentence classiﬁcation.EMNLP, 2014

work page 2014
[5]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catas- trophic forgetting in neural networks.PNAS, 114(13):3521–3526, 2017

work page 2017
[6]

Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188–1196, 2014

work page 2014
[7]

Lee, J.-H

S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang. Overcoming catastrophic forgetting by incremental moment matching. InNeurlPS, pages 4652–4662, 2017

work page 2017
[8]

X. Mu, K. M. Ting, and Z.-H. Zhou. Classiﬁcation under streaming emerging new classes: A solution using completely-random trees.IEEE TKDE, 29(8), 2017

work page 2017
[9]

X. Mu, F. Zhu, J. Du, E.-P. Lim, and Z.-H. Zhou. Streaming classiﬁcation with emerging new class by class matrix sketching. InAAAI, 2017

work page 2017
[10]

Rebuﬃ, A

S.-A. Rebuﬃ, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classiﬁer and representation learning. InCVPR, pages 2001–2010, 2017

work page 2001
[11]

H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. InNeurlPS, pages 2990–2999, 2017. Continual Rare-Class Recognition with Emerging Novel Subclasses 17

work page 2017
[12]

L. Shu, H. Xu, and B. Liu. Doc: Deep open classiﬁcation of text documents. EMNLP, 2017

work page 2017
[13]

L. Shu, H. Xu, and B. Liu. Unseen class discovery in open-world classiﬁcation. arXiv preprint arXiv:1801.05609, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Anomalydetectioninstreams with extreme value theory

A.Siﬀer,P.-A.Fouque,A.Termier,andC.Largouet. Anomalydetectioninstreams with extreme value theory. InKDD, pages 1067–1075. ACM, 2017

work page 2017
[15]

H. Xu, B. Liu, L. Shu, and P. Yu. Open-world learning and application to product classiﬁcation. In WWW, 2019

work page 2019
[16]

Zhang, J

X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classiﬁcation. In NeurlPS, pages 649–657, 2015. 18 Hung Nguyen Xuejian Wang Leman Akoglu A Supplementary A.1 Proof of Theorem 1. Given thatℓ(·)and L-pnorms forp≥ 1are convex, and that sum of non-negative convex functions remains convex, it suﬃces to show that the correlation ...

work page 2015
[17]

(8) that do not depend onw0

∂C ∂w0∂w0 : ∂C ∂w0,z∂w0,t = ∂ ∂w0,z∂w0,t ∑ p,q {µ 4 (w2 0,pw2 0,q) ( xT [p]x[q] )2    Z1 +µ 2 K∑ k′=1 (w2 0,pw2 k′,q) ( xT [p]x[q] )2    Z2 } (9) which excludes the terms in Eq. (8) that do not depend onw0. ∂Z1 ∂w0,z∂w0,t =    2µw 0,zw0,t (xT [z]x[t])2 if t̸=z 2µw 2 0,z (xT [z]x[z])2 +µ ∑ qw2 0,q (xT [z]x[q])2 if t =z    (10) Continual Rare-...

work page
[18]

(8) that do not depend onwk

∂C ∂wk∂wk : ∂C ∂wk,z∂wk,t = ∂ ∂wk,z∂wk,t ∑ p,q {µ 4 (w2 k,pw2 k,q) ( xT [p]x[q] )2    K1 +µ 2 (w2 0,pw2 k,q) ( xT [p]x[q] )2    K2 } (13) which excludes the terms in Eq. (8) that do not depend onwk. ∂K1 ∂wk,z∂wk,t =    2µw k,zwk,t (xT [z]x[t])2 if t̸=z 2µw 2 k,z (xT [z]x[z])2 +µ ∑ qw2 k,q (xT [z]x[q])2 if t =z    (14) ∂K2 ∂wk,z∂wk,t =    ...

work page
[19]

(8) that do not depend on bothw0 and wk

∂C ∂w0∂wk : ∂C ∂w0,z∂wk,t = ∂ ∂w0,z∂wk,t ∑ p,q {µ 2 (w2 0,pw2 k,q) ( xT [p]x[q] )2    T } (17) which excludes the terms in Eq. (8) that do not depend on bothw0 and wk. ∂T ∂w0,z∂wk,t = 2µw 0,zwk,t (xT [z]x[t])2⇒ ∂C ∂w0∂wk = 2µ [ w0wT k⊙ G⊙ G ] (18) Let us denoteD0 = diag(v(0) 1 ,...,v (0) d ), Dk = diag(v(k) 1 ,...,v (k) d ), and G2 = G⊙ G. Then the He...

work page arXiv 1966

[1] [1]

Chen and B

Z. Chen and B. Liu. Lifelong machine learning.Synthesis Lectures on Artiﬁcial Intelligence and Machine Learning, 10(3):1–145, 2016

work page 2016

[2] [2]

R. French. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences, 3:128–135, 05 1999

work page 1999

[3] [3]

Kemker and C

R. Kemker and C. Kanan. Fearnet: Brain-inspired model for incremental learning. In ICLR, 2018

work page 2018

[4] [4]

Y. Kim. Convolutional neural networks for sentence classiﬁcation.EMNLP, 2014

work page 2014

[5] [5]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catas- trophic forgetting in neural networks.PNAS, 114(13):3521–3526, 2017

work page 2017

[6] [6]

Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In ICML, volume 14, pages 1188–1196, 2014

work page 2014

[7] [7]

Lee, J.-H

S.-W. Lee, J.-H. Kim, J. Jun, J.-W. Ha, and B.-T. Zhang. Overcoming catastrophic forgetting by incremental moment matching. InNeurlPS, pages 4652–4662, 2017

work page 2017

[8] [8]

X. Mu, K. M. Ting, and Z.-H. Zhou. Classiﬁcation under streaming emerging new classes: A solution using completely-random trees.IEEE TKDE, 29(8), 2017

work page 2017

[9] [9]

X. Mu, F. Zhu, J. Du, E.-P. Lim, and Z.-H. Zhou. Streaming classiﬁcation with emerging new class by class matrix sketching. InAAAI, 2017

work page 2017

[10] [10]

Rebuﬃ, A

S.-A. Rebuﬃ, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classiﬁer and representation learning. InCVPR, pages 2001–2010, 2017

work page 2001

[11] [11]

H. Shin, J. K. Lee, J. Kim, and J. Kim. Continual learning with deep generative replay. InNeurlPS, pages 2990–2999, 2017. Continual Rare-Class Recognition with Emerging Novel Subclasses 17

work page 2017

[12] [12]

L. Shu, H. Xu, and B. Liu. Doc: Deep open classiﬁcation of text documents. EMNLP, 2017

work page 2017

[13] [13]

L. Shu, H. Xu, and B. Liu. Unseen class discovery in open-world classiﬁcation. arXiv preprint arXiv:1801.05609, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Anomalydetectioninstreams with extreme value theory

A.Siﬀer,P.-A.Fouque,A.Termier,andC.Largouet. Anomalydetectioninstreams with extreme value theory. InKDD, pages 1067–1075. ACM, 2017

work page 2017

[15] [15]

H. Xu, B. Liu, L. Shu, and P. Yu. Open-world learning and application to product classiﬁcation. In WWW, 2019

work page 2019

[16] [16]

Zhang, J

X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classiﬁcation. In NeurlPS, pages 649–657, 2015. 18 Hung Nguyen Xuejian Wang Leman Akoglu A Supplementary A.1 Proof of Theorem 1. Given thatℓ(·)and L-pnorms forp≥ 1are convex, and that sum of non-negative convex functions remains convex, it suﬃces to show that the correlation ...

work page 2015

[17] [17]

(8) that do not depend onw0

∂C ∂w0∂w0 : ∂C ∂w0,z∂w0,t = ∂ ∂w0,z∂w0,t ∑ p,q {µ 4 (w2 0,pw2 0,q) ( xT [p]x[q] )2    Z1 +µ 2 K∑ k′=1 (w2 0,pw2 k′,q) ( xT [p]x[q] )2    Z2 } (9) which excludes the terms in Eq. (8) that do not depend onw0. ∂Z1 ∂w0,z∂w0,t =    2µw 0,zw0,t (xT [z]x[t])2 if t̸=z 2µw 2 0,z (xT [z]x[z])2 +µ ∑ qw2 0,q (xT [z]x[q])2 if t =z    (10) Continual Rare-...

work page

[18] [18]

(8) that do not depend onwk

∂C ∂wk∂wk : ∂C ∂wk,z∂wk,t = ∂ ∂wk,z∂wk,t ∑ p,q {µ 4 (w2 k,pw2 k,q) ( xT [p]x[q] )2    K1 +µ 2 (w2 0,pw2 k,q) ( xT [p]x[q] )2    K2 } (13) which excludes the terms in Eq. (8) that do not depend onwk. ∂K1 ∂wk,z∂wk,t =    2µw k,zwk,t (xT [z]x[t])2 if t̸=z 2µw 2 k,z (xT [z]x[z])2 +µ ∑ qw2 k,q (xT [z]x[q])2 if t =z    (14) ∂K2 ∂wk,z∂wk,t =    ...

work page

[19] [19]

(8) that do not depend on bothw0 and wk

∂C ∂w0∂wk : ∂C ∂w0,z∂wk,t = ∂ ∂w0,z∂wk,t ∑ p,q {µ 2 (w2 0,pw2 k,q) ( xT [p]x[q] )2    T } (17) which excludes the terms in Eq. (8) that do not depend on bothw0 and wk. ∂T ∂w0,z∂wk,t = 2µw 0,zwk,t (xT [z]x[t])2⇒ ∂C ∂w0∂wk = 2µ [ w0wT k⊙ G⊙ G ] (18) Let us denoteD0 = diag(v(0) 1 ,...,v (0) d ), Dk = diag(v(k) 1 ,...,v (k) d ), and G2 = G⊙ G. Then the He...

work page arXiv 1966