arxiv: 2605.11742 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

Online Continual Learning with Dynamic Label Hierarchies

Xinrui Wang , Shao-Yuan Li , Bart{\l}omiej Twardowski , Alexandra Gomez-Villa , Songcan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords online continual learningdynamic label hierarchieshierarchical prototypescatastrophic forgettingmixed granularity supervisionadaptive classification headstaxonomy evolution

0 comments

The pith

Organized learnable hierarchical prototypes regularize adaptive classification heads to handle evolving label taxonomies in online continual learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new setting, DHOCL, in which online continual learning streams have label hierarchies that change both horizontally across siblings and vertically across granularities, while each sample supplies supervision at only one level. This mixed-granularity input creates partial signals that limit plasticity and break cross-level consistency, while the changing structure also produces granularity-specific interference that breaks standard replay and regularization. HALO counters both problems by adaptively merging complementary classification heads and anchoring them with organized learnable hierarchical prototypes that enforce structure during updates. The prototypes therefore support rapid adaptation and prevent the usual catastrophic forgetting as the taxonomy shifts. Experiments across benchmarks show gains in hierarchical accuracy, mistake severity, and overall continual performance.

Core claim

HALO adaptively combines complementary classification heads, regularized by organized learnable hierarchical prototypes, enabling rapid adaptation, hierarchical consistency, and structured knowledge consolidation as the taxonomy evolves.

What carries the argument

Organized learnable hierarchical prototypes that regularize the adaptively combined classification heads to preserve structure across changing granularities.

If this is right

Partial supervision at single hierarchy levels no longer limits plasticity or cross-level consistency.
Granularity-dependent interference is reduced, stabilizing replay buffers and regularization terms.
Knowledge consolidates in a structured manner that tracks taxonomy changes rather than being overwritten.
Hierarchical accuracy rises and mistake severity falls relative to methods that ignore hierarchy dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prototype organization could transfer to other continual settings where concepts carry implicit hierarchies, such as evolving product catalogs or medical diagnosis codes.
Visual inspection of the learned prototypes might reveal how the model reorganizes knowledge when a new level is introduced.
The method suggests that explicit hierarchy modeling should be default rather than optional for real-world lifelong learning systems.
Testing the prototypes on non-image modalities with naturally changing taxonomies would clarify whether the benefit is modality-specific.

Load-bearing premise

Dynamically evolving hierarchies can be captured and regularized through organized learnable prototypes without introducing new interference or needing per-evolution hyperparameter tuning.

What would settle it

If HALO shows no gain in hierarchical accuracy or higher forgetting rates than flat-label baselines on streams where label granularity shifts frequently and at irregular intervals, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.11742 by Alexandra Gomez-Villa, Bart{\l}omiej Twardowski, Shao-Yuan Li, Songcan Chen, Xinrui Wang.

**Figure 1.** Figure 1: Hierarchy formation in three existing settings (colors denote classes and ellipses indicate time continuation). Top: OCL (Koh et al., 2022) uses flat labels only, yielding a flat final label space. Middle: HLE and IIRC (Lee et al., 2023; Abdelsalam et al., 2021) operate under a strict coarse-to-fine curriculum, in which parent classes must be introduced before their descendants, yielding a predefined and f… view at source ↗

**Figure 2.** Figure 2: Performance comparison under DHOCL on CIFAR-100 (Krizhevsky et al., 2009), FGVC-Aircraft (Maji et al., 2013), CUB200 (Wah et al., 2011), and iNaturalist (Van Horn et al., 2018) (from left to right). All methods employ reservoir sampling-based memory buffers with sizes of 1000, 1000, 1000, and 5000, respectively. Evaluation is across five metrics capturing overall performance, fine-grained accuracy, and mi… view at source ↗

**Figure 3.** Figure 3: Performance comparison of four anti-forgetting strategies under DHOCL on CIFAR-100 (left) and iNaturalist (right). 50 250 450 650 850 Iteration 10 20 30 40 50 60 70 80 Fine-grained Accuracy RS EWC++ LwF ACIL SDC 500 2500 4500 6500 8500 Iteration 10 20 30 40 50 60 70 Fine-grained Accuracy RS EWC++ LwF ACIL SDC [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Learning dynamic of different methods on CIFAR-100 (left) and iNaturalist (right) under DHOCL. Performance is evaluated on fine-grained Acc, which reflects forgetting resistance. Having established the limitations of hierarchical losses, we now examine whether standard anti-forgetting strategies can compensate for these deficiencies. To this end, we evaluate four representative baselines: parameter regu… view at source ↗

**Figure 5.** Figure 5: Framework overview. Left: a trainable backbone f and a frozen backbone f0 are followed by linear and analytic heads. PredLA aggregates calibrated predictions per level. Bottom: an add-on adapter fA maps features to a prototype space M. Right: HPR maintains class-specific prototype banks to form attention maps, enforcing consistency between consecutive steps and hierarchical alignment by pulling ancestor–de… view at source ↗

**Figure 6.** Figure 6: Normalized distance matrices illustrating the hierarchical relations among class labels and their prototypes. From left to right, the matrices are (i) distances derived from the ground-truth label hierarchy based on the normalized LCA (Lowest Common Ancestor) depth, (ii) distances between hierarchical prototypes obtained without hierarchical prototype regularization (HPR), and (iii) distances between hiera… view at source ↗

**Figure 7.** Figure 7: Sensitivity analysis on λ and δ. larity term matches our optimization objective in Eq. 5. We compare these learned distances against ground-truth (ideal) taxonomic distances (measured by normalized LCA depth in the hierarchy tree). The results show, as HPR is applied, the learned prototype distance matrices become increasingly aligned with the ground-truth hierarchies. Sensitivity Analysis. Given that HALO… view at source ↗

**Figure 8.** Figure 8: Learning dynamic of different methods on CIFAR-100, FGVC-Aircraft, CUB and iNaturalist under DHOCL. Performance is evaluated on fine-grained Acc, which reflects forgetting resistance. B.3. Hyperparameters Selection As elaborated in the main paper, HALO integrates multiple modules, which inevitably introduces several tunable hyperparameters that may affect its performance. In this section, we provide addit… view at source ↗

**Figure 9.** Figure 9: Sensitivity analysis on λ, δ, γ, |P h c |, K, and m [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of memory sampling strategies on CIFAR-100 (left) and iNaturalist (right) under hierarchical label streams. Performance is evaluated by fine-grained accuracy, which best reflects forgetting resistance [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Model throughput of different methods under DHOCL on CIFAR-100. We measure the number of samples processed per second during training [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Growing label hierarchies for CIFAR-100. The left, middle, and right subfigures show the label hierarchies after confronting 10, 50, 100 (all) fine-grained classes. D. Furhter Analysis on Memory Sampling Strategies D.1. Detailed Analysis of Sampling Biases under Hierarchical Supervision The limitations of the three mainstream memory-selection strategies—loss-, gradient-, and class-balanced sampling—can be… view at source ↗

**Figure 13.** Figure 13: Normalized distance matrices illustrating the hierarchical relations among class labels and their prototypes. From left to right, the matrices are (i) distances derived from the ground-truth label hierarchy based on the normalized LCA (Lowest Common Ancestor) depth, (ii) distances between hierarchical prototypes obtained without hierarchical prototype regularization (HPR), and (iii) distances between hier… view at source ↗

**Figure 14.** Figure 14: Difference matrices between the learned prototype hierarchies and the ground-truth label hierarchy. Each matrix visualizes the deviation from the ground-truth normalized LCA-based distances. The left and right columns correspond to the results obtained with and without HPR. As shown, incorporating HPR notably reduces these discrepancies, providing direct empirical evidence that our proposed module effecti… view at source ↗

read the original abstract

Online Continual Learning (OCL) aims to learn from endless non\text{-}stationary data streams, yet most existing methods assume a flat label space and overlook the hierarchical organization of real\text{-}world concepts that evolves both horizontally (sibling classes) and vertically (coarse or fine categories). To better reflect this context, we introduce a new problem setting, DHOCL (Online Continual Learning from Dynamic Hierarchies), where taxonomies evolve across granularities and each sample provides supervision at a single hierarchical level. In this setting, we find two fundamental issues: (i) partial supervision under mixed granularities provides only point-wise signals over an evolving path-wise hierarchy, which constrains plasticity and undermines cross-level semantic consistency, and (ii) the dynamically evolving hierarchies induce granularity-dependent interference, destabilizing popular replay and regularization mechanisms and thereby exacerbating catastrophic forgetting. To tackle these issues, we propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which adaptively combines complementary classification heads, regularized by organized learnable hierarchical prototypes, enabling rapid adaptation, hierarchical consistency, and structured knowledge consolidation as the taxonomy evolves. Extensive experiments on multiple benchmarks demonstrate that HALO consistently outperforms existing methods across hierarchical accuracy, mistake severity, and continual performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new OCL setting with evolving label hierarchies and shows that organized prototypes plus adaptive heads can reduce the resulting interference without obvious contradictions in the derivation.

read the letter

The core contribution is the DHOCL problem: online streams where the taxonomy changes both horizontally and vertically, and each sample gives supervision at only one level. This creates partial signals and granularity interference that break standard replay and regularization. HALO tries to fix it by running complementary heads whose outputs are tied together through learnable prototypes that are explicitly organized to respect the current hierarchy, plus cross-level consistency losses. The stress-test note confirms the losses match the stated goals and the experiments use controlled evolution schedules, so the internal logic is sound on its own terms. That is the main thing worth knowing: they have turned a plausible complaint about flat-label OCL into a concrete, solvable setting rather than just another benchmark tweak. The method itself is an incremental but coherent extension of existing regularization ideas; nothing in the abstract suggests hidden fitting or post-hoc choices that would collapse the argument. Experiments report gains on hierarchical accuracy and mistake severity across multiple benchmarks, which is the expected place to look for value. The soft spots are the usual ones for this style of work. The abstract gives no error bars or exact hierarchy-generation details, so it is still possible that the gains are sensitive to how the taxonomies are sampled or to extra hyper-parameters around prototype organization. If the full paper does not include ablations that vary evolution speed or label granularity independently, that would be the section to press on in review. Otherwise the claims look proportionate to what is shown. This paper is for the continual-learning subgroup that already cares about structure and partial labels. A reader who works on hierarchical classification or knowledge consolidation will find the problem definition and the prototype construction useful even if they end up using a different mechanism. It is coherent enough and the setting is new enough that it should go to referees rather than get desk-rejected.

Referee Report

0 major / 2 minor

Summary. The paper introduces the DHOCL setting for online continual learning under dynamically evolving label hierarchies, where each sample receives supervision at only one granularity level. It identifies two core issues—partial supervision constraining plasticity and cross-level consistency, plus granularity-dependent interference destabilizing replay and regularization—and proposes HALO, which adaptively combines complementary classification heads regularized by organized learnable hierarchical prototypes to support rapid adaptation, hierarchical consistency, and structured consolidation. Experiments on multiple benchmarks report consistent gains in hierarchical accuracy, mistake severity, and continual performance metrics.

Significance. If the results hold under controlled hierarchy evolution schedules, the work is significant for moving OCL beyond flat label spaces toward more realistic evolving taxonomies. The explicit tying of prototype organization to cross-level consistency losses and the use of adaptive heads provide a coherent mechanism for handling partial supervision without requiring oracle taxonomies or bounded evolution rates. This could serve as a useful baseline for future research on hierarchical continual learning.

minor comments (2)

The abstract states consistent outperformance but omits details on experimental controls, error bars, and exact hyperparameter settings for hierarchy evolution; adding these in the main text or appendix would strengthen verifiability.
Notation for the organized prototypes and cross-level consistency losses could be made more explicit (e.g., by defining the prototype organization matrix in a dedicated equation) to aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their encouraging summary and for recognizing the potential significance of introducing the DHOCL setting and the HALO method for handling dynamically evolving label hierarchies in online continual learning. We appreciate the assessment that the work provides a coherent mechanism for partial supervision and could serve as a baseline for future research.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces DHOCL as a new problem setting and proposes HALO via explicit architectural components (adaptive heads + organized prototypes) and loss terms for cross-level consistency. These are defined directly in the method without reducing any central claim to a fitted parameter renamed as prediction, a self-citation chain, or an imported uniqueness theorem. The derivation remains self-contained against external benchmarks and controlled hierarchy-evolution experiments; no load-bearing step collapses by construction to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard continual learning mechanisms can be extended with hierarchical prototypes; no explicit free parameters or invented entities are detailed in the abstract, but the regularization likely involves learned components fitted during training.

axioms (1)

domain assumption Non-stationary data streams with evolving taxonomies provide supervision at single hierarchical levels.
Core to the DHOCL definition in the abstract.

pith-pipeline@v0.9.0 · 5536 in / 1162 out tokens · 56707 ms · 2026-05-13T06:30:47.931956+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HALO employs learnable hierarchical prototypes... PredLA aggregates calibrated predictions per level... HPR maintains class-specific prototype banks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

On Tiny Episodic Memories in Continual Learning

Chaudhry, A., Ranzato, M., Rohrbach, M., and Elhoseiny, M. Efficient lifelong learning with a-gem.ICLR, 2019a. Chaudhry, A., Rohrbach, M., Elhoseiny, M., Ajanthan, T., Dokania, P. K., Torr, P. H., and Ranzato, M. On tiny episodic memories in continual learning.arXiv preprint arXiv:1902.10486, 2019b. Chaudhry, A., Khan, N., Dokania, P., and Torr, P. Contin...

work page Pith review arXiv 1902
[3]

and Moens, M.-F

Chrysakis, A. and Moens, M.-F. Online continual learning from imbalanced data. InInternational Conference on Machine Learning, pp. 1952–1961. PMLR,

work page 1952
[4]

The lie of the average: How class incremental learning evaluation deceives you?arXiv preprint arXiv:2509.22580,

Lai, G., Zhou, D.-W., Yang, X., and Ye, H.-J. The lie of the average: How class incremental learning evaluation deceives you?arXiv preprint arXiv:2509.22580,

work page arXiv
[5]

Fine-Grained Visual Classification of Aircraft

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Rebuffi, S.-A., Kolesnikov, A., Sperl, G., and Lampert, C. H. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010,

work page 2001
[7]

Caltech-ucsd birds-200-2011.California Institute of Technology, pp

Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. Caltech-ucsd birds-200-2011.California Institute of Technology, pp. CNS–TR–2011–001,

work page 2011
[8]

Im- proving plasticity in online continual learning via col- laborative learning

Wang, M., Michel, N., Xiao, L., and Yamasaki, T. Im- proving plasticity in online continual learning via col- laborative learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23460–23469, 2024a. Wang, X., Geng, C., Wan, W., Li, S.-Y ., and Chen, S. For- getting, ignorance or myopia: Revisiting key challenges in...

work page 2025
[9]

arXiv preprint arXiv:2205.13218 (2022)

12 Online Continual Learning with Dynamic Label Hierarchies Zhou, D.-W., Wang, Q.-W., Ye, H.-J., and Zhan, D.-C. A model or 603 exemplars: Towards memory-efficient class- incremental learning.arXiv preprint arXiv:2205.13218,

work page arXiv
[10]

Gacl: Exemplar-free generalized analytic continual learning.Advances in Neural Informa- tion Processing Systems, 37:83024–83047, 2024a

Zhuang, H., Chen, Y ., Fang, D., He, R., Tong, K., Wei, H., Zeng, Z., and Chen, C. Gacl: Exemplar-free generalized analytic continual learning.Advances in Neural Informa- tion Processing Systems, 37:83024–83047, 2024a. Zhuang, H., Liu, Y ., He, R., Tong, K., Zeng, Z., Chen, C., Wang, Y ., and Chau, L.-P. F-oal: Forward-only on- line analytic learning with...

work page 2011
[11]

that encode hierarchical relationships through some specific label or feature embeddings (these embeddings can be typically perceived as class-mean vectors) derived from hierarchical distances or predefined semantics, then align visual features with these embeddings; (2)Loss-based methods that design hierarchical-aware loss functions, such as hierarchical...

work page 2020
[12]

or optimal transport-based losses(Yang et al., 2018; Yurochkin et al., 2019), to penalize misclassifications based on their semantic distances in the hierarchy; and (3)Architecture-based methods(Liang & Davis, 2023; Wu et al., 2016; Liang et al.,

work page 2018
[13]

While these methods effectively exploit hierarchical label structures, they are primarily designed for static settings with fixed label hierarchies

that employ multi-level classification heads or dynamic network structures to handle different granularities(Wu et al., 2016; Liang et al., 2018; Chang et al., 2021; Lu et al., 2025). While these methods effectively exploit hierarchical label structures, they are primarily designed for static settings with fixed label hierarchies. Recent attempts to exten...

work page 2016
[14]

Training of linear heads.Training the linear heads is straightforward: during training we directly optimize both the linear heads and the feature extractor with the cross-entropy loss. Given a mini-batch of features and one-hot labels, 2 Online Continual Learning with Dynamic Label Hierarchies Φ =f(X)∈R d×b andY∈R C×b, we minimize the following cross-entr...

work page 2022
[15]

Table 6.Performance comparison of different methods under a DHOCL stream constructed via coarse-grained partitioning

Algorithm 1Training procedure of HALO 1: Input:Data stream Dt; Updated label tree Tt; Current backbone ft; Prototype adapter fA; Prototype banks {P} ; Replay bufferM 2:Initialize prototypes for new classes; Cache pretrained backbone asf 0; 3:for(x, y)fromD t and(x m, ym)fromMdo 4:Complete the label of streamy, and label of memoryy m to˜y,˜ym along ancesto...

work page 2023
[16]

For this counter-intuitive finding, we give a more detailed analysis in the Sec D. C.2. Results on Imbalanced Hierarchy: ImageNet-H In this section, we report results on ImageNet-H, which has an imbalanced hierarchical structure. We construct two DHOCL streams by partitioning the fine-grained classes into 10 and 20 groups, respectively, following the proc...

work page 2018