arxiv: 2605.11710 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CV

Recognition: no theorem link

Unlocking Compositional Generalization in Continual Few-Shot Learning

Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Long Tran-Thanh, Phu-Hoa Pham, Phu-Quy Nguyen-Lam

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:10 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords continual few-shot learningcompositional generalizationobject-centric representationsslot representationsvision transformersdual-phase strategyunseen concept generalizationcatastrophic forgetting

0 comments

The pith

By optimizing slot representations for holistic class identity in training and composing them at inference, the framework achieves strong generalization to novel concepts with minimal forgetting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a core conflict in continual few-shot learning: models that collapse scenes into single embeddings lose object details, while those using part-level matching during training tie features too tightly to patterns seen so far. It proposes a strict separation where self-supervised Vision Transformer patches are turned into slots optimized only for overall class identity, keeping their geometries general. At test time those preserved slots are assembled on the fly to fit new scenes. If this separation works, models could keep learning new tasks without erasing their ability to recognize entirely new objects from few examples.

Core claim

The authors claim that a dual-phase strategy strictly decouples representation learning from compositional inference: during training, slot representations are optimized entirely toward holistic class identity using the patch-level geometry of a frozen self-supervised Vision Transformer backbone, which preserves generalizable object-level features; at inference, these slots are dynamically composed to match novel scenes, yielding state-of-the-art performance on unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.

What carries the argument

The dual-phase strategy that decouples holistic class-identity optimization during training from dynamic slot composition at inference, using preserved patch-level semantic geometry of self-supervised Vision Transformers.

If this is right

The frozen backbone prevents representation drift across sequential tasks.
Lightweight holistic optimization during training keeps the features usable for entirely new concepts.
Dynamic composition of preserved slots at inference allows matching of novel scenes without retraining the backbone.
The approach yields state-of-the-art unseen-concept generalization together with minimal catastrophic forgetting on standard continual few-shot benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling might apply to other self-supervised backbones if they exhibit comparable geometric structure in their internal representations.
Avoiding part-level objectives during training could become a general design rule for any continual learner that needs to stay composable over long task sequences.
The method suggests that explicit replay buffers could be reduced or removed if the preserved slot geometries already carry enough information for new compositions.
Testing the same split on non-vision modalities would clarify whether the benefit stems from the ViT geometry itself or from the training-inference separation.

Load-bearing premise

The patch-level semantic geometry inside self-supervised Vision Transformers stays generalizable when slots are optimized only for overall class identity rather than being tied to specific seen patterns.

What would settle it

A controlled comparison on the same benchmarks in which the dual-phase model is replaced by one that either uses global embeddings or applies part-level matching objectives during training; if the gap in novel-concept accuracy and forgetting disappears, the claimed benefit of the decoupling does not hold.

Figures

Figures reproduced from arXiv: 2605.11710 by Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Long Tran-Thanh, Phu-Hoa Pham, Phu-Quy Nguyen-Lam.

**Figure 1.** Figure 1: Top (Phase I): A frozen ViT and slot attention process images. A trainable MLP router and projection head map raw aggregates to unit-norm embeddings, optimized via cross-entropy and a cross-correlation penalty. Bottom (Phase II): Gradient-free inference composes novel scenes by centering and matching slots via bidirectional Chamfer distance. inference (Phase II). By freezing the backbone and slot attention… view at source ↗

read the original abstract

Object-centric representations promise a key property for few-shot learning: Rather than treating a scene as a single unit, a model can decompose it into individual object-level parts that can be matched and compared across different concepts. In practice, this potential is rarely realized. Continual learners either collapse scenes into global embeddings, or train with part-level matching objectives that tie representations too closely to seen patterns, leaving them unable to generalize to truly novel concepts. In this paper, we identify this fundamental structural conflict and pioneer a new paradigm that strictly decouples representation learning from compositional inference. Leveraging the inherent patch-level semantic geometry of self-supervised Vision Transformers (ViTs), our framework employs a dual-phase strategy. During training, slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries. At inference, preserved slots are dynamically composed to match novel scenes. We demonstrate that this paradigm offers dual structural benefits: The frozen backbone naturally prevents representation drift, while our lightweight, holistic optimization preserves the features' capacity for novel-concept transfer. Extensive experiments validate this approach, achieving state-of-the-art unseen-concept generalization and minimal forgetting across standard continual learning benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-phase decoupling of holistic class training from later slot composition is the main new angle, but the claim that ViT patch geometry survives intact rests on an unverified assumption.

read the letter

The paper's core move is to train slot representations solely toward holistic class identity on seen data, then compose those preserved slots dynamically at inference for novel scenes. This is pitched as a way to get compositional generalization in continual few-shot settings without the usual drift or overfitting to training patterns. That separation is what stands out as different from prior continual few-shot work, which the abstract says either collapses to global embeddings or uses part-level matching that locks representations to seen concepts. The approach also leans on freezing a self-supervised ViT backbone, which is a clean way to limit representation drift over time. Those pieces together give a coherent story for why minimal forgetting and unseen transfer might both be possible. The writeup does a reasonable job spelling out the structural conflict it aims to fix. The soft spot is the missing support for the central assumption. The abstract states that holistic optimization leaves the patch-level semantic geometry generalizable enough for later composition, but it supplies no ablation, invariance check, or measurement showing that the geometry actually survives the class-identity training step. Any gradient signal from seen classes could still pull the slots toward specialization, which would undercut both the generalization and forgetting claims. The SOTA results are asserted without numbers or baseline comparisons in the provided text, so the practical gains are hard to judge. This is the kind of paper that would interest people working on object-centric models for changing environments. A reader already thinking about slot-based or compositional methods in continual learning could pull the decoupling idea and try it out. I would send it for peer review so the experiments and the geometry preservation can be checked directly against the data.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a dual-phase framework for continual few-shot learning that decouples representation learning from compositional inference. It leverages the patch-level semantic geometry of self-supervised Vision Transformers (ViTs): during training, slot representations are optimized exclusively toward holistic class identity (preserving object-level geometries), while at inference the frozen slots are dynamically composed to match novel scenes. The approach claims dual benefits from a frozen backbone (preventing drift) and lightweight holistic optimization (enabling novel-concept transfer), yielding state-of-the-art unseen-concept generalization and minimal forgetting on standard continual-learning benchmarks.

Significance. If the invariance claim holds, the work would meaningfully advance compositional generalization in continual few-shot settings by resolving the tension between stability and adaptability. The strict decoupling of holistic training from dynamic inference, combined with the use of self-supervised ViT geometry, offers a clean architectural separation that could reduce catastrophic forgetting while supporting transfer to truly novel concepts—strengths that would be notable if supported by rigorous invariance measurements and ablations.

major comments (2)

[Abstract] Abstract: The load-bearing claim that 'slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries' receives no supporting measurement, ablation, or invariance analysis (e.g., no cosine similarity, patch-alignment scores, or before/after geometry metrics on seen vs. unseen classes). Without such evidence the premise that holistic optimization leaves ViT patch geometry intact for novel composition remains unverified and directly undermines the SOTA generalization assertions.
[Abstract] Abstract and method description: The assertion that the frozen backbone 'naturally prevents representation drift' while still allowing 'dynamic composition' at inference is stated without detailing the composition mechanism, the slot-matching procedure, or any control experiments isolating the contribution of geometry preservation versus the frozen weights.

minor comments (1)

[Abstract] Abstract: The statement 'achieving state-of-the-art unseen-concept generalization' is made without any numerical results, baseline comparisons, or dataset names, which is required for a claim of this strength even in an abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and outline the specific revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing claim that 'slot representations are optimized entirely toward holistic class identity, preserving highly generalizable, object-level geometries' receives no supporting measurement, ablation, or invariance analysis (e.g., no cosine similarity, patch-alignment scores, or before/after geometry metrics on seen vs. unseen classes). Without such evidence the premise that holistic optimization leaves ViT patch geometry intact for novel composition remains unverified and directly undermines the SOTA generalization assertions.

Authors: We agree that the abstract claim would be strengthened by explicit supporting measurements. In the revised manuscript we will add a dedicated invariance analysis subsection that reports cosine similarity between patch embeddings before and after holistic optimization, patch-alignment scores across seen and unseen classes, and before/after geometry metrics. These quantitative results will directly verify that the optimization preserves object-level geometries while still supporting the reported generalization performance. revision: yes
Referee: [Abstract] Abstract and method description: The assertion that the frozen backbone 'naturally prevents representation drift' while still allowing 'dynamic composition' at inference is stated without detailing the composition mechanism, the slot-matching procedure, or any control experiments isolating the contribution of geometry preservation versus the frozen weights.

Authors: We will expand the method section with a precise description of the slot-matching procedure (including the similarity metric and dynamic composition rule used at inference) and the exact mechanism by which the frozen backbone prevents drift. In addition, we will include new control experiments that compare the frozen-backbone setting against a fine-tuned backbone variant, thereby isolating the separate contributions of geometry preservation and drift prevention to the observed generalization and forgetting results. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on methodological design and empirical results rather than self-referential definitions or fitted inputs.

full rationale

The paper's core argument is a proposed dual-phase framework: holistic class-identity optimization during training on seen data is asserted to preserve ViT patch geometries for later compositional inference on novel concepts. This is presented as an empirical design choice validated on continual learning benchmarks, with no equations, parameter fits, or self-citations shown that reduce the preservation claim to a tautology or input by construction. No load-bearing step equates the output (unseen generalization) to the training objective via redefinition or renaming. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that self-supervised ViT patch geometries stay compositional and generalizable when trained only for holistic class identity.

axioms (1)

domain assumption Self-supervised Vision Transformers possess inherent patch-level semantic geometry that remains generalizable to novel concepts when optimized holistically for class identity.
Directly invoked to justify the training phase of the dual-phase strategy.

pith-pipeline@v0.9.0 · 5526 in / 1149 out tokens · 53051 ms · 2026-05-13T07:10:34.349602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages

[1]

Does Continual Learning Meet Compositionality?

Liao, Weiduo and Wei, Ying and Jiang, Mingchen and Zhang, Qingfu and Ishibuchi, Hisao , booktitle=. Does Continual Learning Meet Compositionality?

work page
[2]

Mark McDonnell and Dong Gong and Amin Parvaneh and Ehsan Abbasnejad and Anton van den Hengel , booktitle=. Ran. 2023 , url=

work page 2023
[3]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

work page
[4]

Weighted Ensemble Models Are Strong Continual Learners , year =

Marouf, Imad Eddine and Roy, Subhankar and Tartaglione, Enzo and Lathuili\`. Weighted Ensemble Models Are Strong Continual Learners , year =. Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXI , pages =. doi:10.1007/978-3-031-73209-6_18 , abstract =

work page doi:10.1007/978-3-031-73209-6_18 2024
[5]

Advances in Neural Information Processing Systems , volume=

Object-Centric Learning with Slot Attention , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

The Eleventh International Conference on Learning Representations , year=

Bridging the Gap to Real-World Object-Centric Learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[7]

Transactions on Machine Learning Research , issn=

Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[8]

Zhou, Jinghao and Wei, Chen and Wang, Huiyu and Shen, Wei and Xie, Cihang and Yuille, Alan and Kong, Tao , booktitle=. i

work page
[9]

Proceedings of the 38th International Conference on Machine Learning , pages=

Learning Transferable Visual Models From Natural Language Supervision , author=. Proceedings of the 38th International Conference on Machine Learning , pages=

work page
[10]

Proceedings of the 38th International Conference on Machine Learning , pages=

Barlow Twins: Self-Supervised Learning via Redundancy Reduction , author=. Proceedings of the 38th International Conference on Machine Learning , pages=

work page
[11]

Bardes, Adrien and Ponce, Jean and LeCun, Yann , booktitle=

work page
[12]

arXiv preprint arXiv:1911.04623 , year=

SimpleShot: Revisiting Nearest-Neighbor Classification for Few-Shot Learning , author=. arXiv preprint arXiv:1911.04623 , year=

work page arXiv 1911
[13]

CVPR , year=

FEAT: Few-Shot Learning via Embedding Adaptation with Set-to-Set Functions , author=. CVPR , year=

work page
[14]

NeurIPS , year=

Prototypical Networks for Few-Shot Learning , author=. NeurIPS , year=

work page
[15]

Preprint , year=

Compositional Few-Shot Class Incremental Learning , author=. Preprint , year=

work page
[16]

NeurIPS , year=

Compositional Zero-Shot Learning via Fine-Grained Dense Feature Composition , author=. NeurIPS , year=

work page
[17]

CVPR , year=

Learning Graph Embeddings for Compositional Zero-Shot Learning , author=. CVPR , year=

work page
[18]

Preprint , year=

On the Interaction of Variance Objectives and Classification Losses , author=. Preprint , year=

work page
[19]

Eva-02: A visual representation for neon genesis

EVA-02: A Visual Representation for Neon Genesis , author=. arXiv preprint arXiv:2303.11331 , year=

work page arXiv
[20]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Caron, Mathilde and Touvron, Hugo and Misra, Ishan and J\'egou, Herv\'e and Mairal, Julien and Bojanowski, Piotr and Joulin, Armand , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2021 , pages =

work page 2021
[21]

2025 , eprint=

DINOv3 , author=. 2025 , eprint=

work page 2025
[22]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Zou, Yixiong and Zhang, Shanghang and Zhou, Haichen and Li, Yuhua and Li, Ruixuan , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[23]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

Evolutionary Generalized Zero-Shot Learning , author =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,. 2024 , month =. doi:10.24963/ijcai.2024/70 , url =

work page doi:10.24963/ijcai.2024/70 2024
[24]

The Fourteenth International Conference on Learning Representations , year=

Plug-and-Play Compositionality for Boosting Continual Learning with Foundation Models , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[25]

Bootstrap your own latent a new approach to self-supervised learning , year =

Grill, Jean-Bastien and Strub, Florian and Altch\'. Bootstrap your own latent a new approach to self-supervised learning , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

work page
[26]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Chen, Ting and Kornblith, Simon and Norouzi, Mohammad and Hinton, Geoffrey , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020
[27]

2023 , eprint=

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture , author=. 2023 , eprint=

work page 2023
[28]

International Conference on Learning Representations , year=

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=

work page
[29]

, title =

Rebuffi, Sylvestre-Alvise and Kolesnikov, Alexander and Sperl, Georg and Lampert, Christoph H. , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =

work page
[30]

Learning Multiple Layers of Features from Tiny Images , author=

work page
[31]

and Branson, S

Wah, C. and Branson, S. and Welinder, P. and Perona, P. and Belongie, S. , institution=. The Caltech-

work page
[32]

Advances in Neural Information Processing Systems , pages=

Matching Networks for One Shot Learning , author=. Advances in Neural Information Processing Systems , pages=

work page
[33]

The Many Faces of Robustness:

Hendrycks, Dan and Basart, Steven and Mu, Norman and Kadavath, Saurav and Wang, Frank and Dorundo, Evan and Desai, Rahul and Zhu, Tyler and Parajuli, Samyak and Guo, Mike and Song, Dawn and Steinhardt, Jacob and Gilmer, Justin , booktitle=. The Many Faces of Robustness:

work page
[34]

Masked Autoencoders Are Scalable Vision Learners , booktitle =

He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll. Masked Autoencoders Are Scalable Vision Learners , booktitle =. 2022 , pages =

work page 2022
[35]

F-OAL: Forward-only Online Analytic Learning with Fast Training and Low Memory Footprint in Class Incremental Learning , url =

Zhuang, Huiping and Liu, Yuchen and He, Run and Tong, Kai and Zeng, Ziqian and Chen, Cen and Wang, Yi and Chau, Lap-Pui , booktitle =. F-OAL: Forward-only Online Analytic Learning with Fast Training and Low Memory Footprint in Class Incremental Learning , url =. doi:10.52202/079017-1314 , editor =

work page doi:10.52202/079017-1314
[36]

2024 , eprint=

Provable Compositional Generalization for Object-Centric Learning , author=. 2024 , eprint=

work page 2024
[37]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

Reproducible Scaling Laws for Contrastive Language-Image Learning , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

work page
[38]

, title =

Fan, Haoqiang and Su, Hao and Guibas, Leonidas J. , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2017 , pages =

work page 2017
[39]

Vardan Papyan and X. Y. Han and David L. Donoho , title =. Proceedings of the National Academy of Sciences , volume =. 2020 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2015509117 , abstract =

work page doi:10.1073/pnas.2015509117 2020
[40]

2025 , eprint=

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics , author=. 2025 , eprint=

work page 2025
[41]

Computational Optimal Transport: With Applications to Data Science , publisher =

Peyr. Computational Optimal Transport: With Applications to Data Science , publisher =. 2019 , series =

work page 2019
[42]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Cuturi, Marco , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[43]

Learning Generative Models with

Genevay, Aude and Peyr. Learning Generative Models with. Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2018 , publisher =

work page 2018
[44]

and Belanger, David and Linderman, Scott W

Mena, Gonzalo E. and Belanger, David and Linderman, Scott W. and Snoek, Jasper , title =. International Conference on Learning Representations (ICLR) , year =

work page
[45]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

Luise, Giulia and Rudi, Alessandro and Pontil, Massimiliano and Ciliberto, Carlo , title =. Advances in Neural Information Processing Systems (NeurIPS) , volume =

work page
[46]

Interpolating between Optimal Transport and

Feydy, Jean and S. Interpolating between Optimal Transport and. Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) , series =. 2019 , publisher =

work page 2019
[47]

, title =

Clarke, Frank H. , title =. 1990 , series =

work page 1990
[48]

SIAM Journal on Optimization , volume =

Bolte, J. SIAM Journal on Optimization , volume =. 2007 , doi =

work page 2007
[49]

Proceedings of the 35th International Conference on Machine Learning (ICML) , pages =

Attention-based Deep Multiple Instance Learning , author =. Proceedings of the 35th International Conference on Machine Learning (ICML) , pages =. 2018 , publisher =

work page 2018
[50]

Dipam Goswami and Yuyang Liu and Bart. Fe. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[51]

SCIENCE CHINA Information Sciences , year=

PILOT: A Pre-Trained Model-Based Continual Learning Toolbox , author=. SCIENCE CHINA Information Sciences , year=

work page
[52]

IJCAI , pages=

Continual learning with pre-trained models: A survey , author=. IJCAI , pages=

work page
[53]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Zhou, Da-Wei and Wang, Qi-Wei and Qi, Zhi-Hong and Ye, Han-Jia and Zhan, De-Chuan and Liu, Ziwei , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

work page