pith. sign in

arxiv: 2606.22999 · v1 · pith:WILASUT6new · submitted 2026-06-22 · 💻 cs.CV

Black-Box Continual Learning for Vision-Language Models

Pith reviewed 2026-06-26 09:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords black-box continual learningvision-language modelstextual prototypescatastrophic forgettingparameter-efficient tuningcontinual learning benchmarkSemantic Projection Accumulation
0
0 comments X

The pith

Optimizing only textual prototypes enables black-box continual learning on VLMs to match white-box performance with 0.05M parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a Black-CL benchmark that restricts continual learning to output embeddings or logits only, with no gradients, weight access, or architecture changes. It proposes BETA as a method whose core mechanism is the adjustment of textual prototypes via three modules: Semantic Projection Accumulation to build knowledge over time, Latent Distribution Replay to stabilize the embedding space, and Test-Time Prototype Adaptation to refine decisions per instance. Experiments across ten datasets and multiple backbones show this approach reaches or exceeds white-box continual learning results while using 180 to 3000 times fewer trainable parameters.

Core claim

Solely optimizing textual prototypes can navigate the complexities of continual learning under black-box constraints, as BETA integrates SPA for incremental acquisition, LDR for anchoring against forgetting, and TTPA for instance-aware refinement to achieve performance on par with or exceeding white-box CL methods.

What carries the argument

Optimization of textual prototypes, carried out through Semantic Projection Accumulation for knowledge growth, Latent Distribution Replay for embedding stability, and Test-Time Prototype Adaptation for boundary refinement.

If this is right

  • Continual learning becomes feasible for cloud-hosted VLMs where backpropagation through the backbone is impossible.
  • Parameter budgets for CL can drop by two to three orders of magnitude while preserving accuracy across diverse tasks.
  • Task-agnostic inference at test time remains viable under strict compute limits.
  • The same prototype-only strategy may apply directly to other output-only interfaces such as API-only language models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to black-box settings in other modalities if output embeddings remain the only accessible signal.
  • Privacy-sensitive deployments gain a practical path to lifelong adaptation without exposing model weights.
  • Online deployment scenarios become more realistic because TTPA operates without retraining the entire system.

Load-bearing premise

That adjustments to textual prototypes alone can compensate for any distribution shifts or forgetting in the visual embedding space without access to model internals or gradients.

What would settle it

A controlled test on a new dataset where visual concept drift causes accuracy to fall below white-box baselines no matter how the textual prototypes are tuned, while keeping computation and access constraints fixed.

Figures

Figures reproduced from arXiv: 2606.22999 by Haoyuan Gao, Lichao Sun, Linghe Kong, Weihang Fang, Weiran Huang, Yexin Li, Yuting Li.

Figure 1
Figure 1. Figure 1: Comparison between traditional continual learning (CL) and the proposed Black-CL setting, with a white-box SOTA comparison. Con￾ventional CL assumes white-box access to the backbone, allowing gradients to propagate through the model and permitting internal modifications such as adapters. Black-CL exposes only model outputs, such as embeddings or logits: the backbone is not merely frozen, but its parameters… view at source ↗
Figure 2
Figure 2. Figure 2: During training stage, BETA incrementally builds a textual prototype library by optimizing class-specific embeddings in the latent space. To mitigate [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Matrix visualization of the evaluation metrics. The horizontal axis denotes the test task, and the vertical axis denotes the stage after sequentially training each task in the continual-learning stream. Each rounded cell represents the accuracy Ri,j obtained after learning task i and testing on task j. The upper-triangular future-task cells are aggregated for Transfer Accuracy; the final-row cells are aver… view at source ↗
Figure 4
Figure 4. Figure 4: Compact analysis of TTPA retention and latency. (a) Transfer, Average, and Last Accuracy under different prototype-retention strategies on the full-shot Black-CL stream. Continuous adaptation achieves the best Transfer and Average Accuracy while matching the best Last Accuracy. (b) Inference latency on a logarithmic scale, with overhead measured relative to the zero-shot baseline. BETA remains much closer … view at source ↗
Figure 5
Figure 5. Figure 5: Visualization and quantitative validation of spherical GMM fitting on CLIP image features. Left: for each dataset, we select one representative class and fit a spherical GMM to its frozen CLIP image features in the original feature space. Points show the first two PCA dimensions, and colors denote component assignments. Ellipses visualize the empirical 2D spread of samples assigned to each spherical compon… view at source ↗
read the original abstract

The rapid deployment of Vision-Language Models (VLMs) in dynamic environments necessitates the ability to learn continuously without forgetting. However, traditional continual learning (CL) settings often rely on white-box paradigms, which is increasingly invalidated by the shift toward cloud-hosted models. In this paper, we introduce Black-CL, a more realistic benchmark for VLMs that enforces three primary real-world challenges: weight and architecture inaccessibility, constrained computation, and task-agnostic inference. The learner can query only output embeddings or logits, with no gradient flow through or structural modification of the backbone. Current CL methodologies, which rely on backbone backpropagation or complex parameter expansion, are fundamentally incompatible with these constraints. Under this setting, we propose BETA, a simple yet effective baseline built on the key insight that solely optimizing textual prototypes can navigate the complexities of CL. BETA integrates three core components: Semantic Projection Accumulation (SPA) for incremental knowledge acquisition, Latent Distribution Replay (LDR) for anchoring the embedding space against catastrophic forgetting, and Test-Time Prototype Adaptation (TTPA) for dynamic, instance-aware boundary refinement. Extensive experiments across ten diverse datasets and various backbones demonstrate that BETA significantly outperforms existing black-box tuners. Remarkably, with only 0.05 M trainable parameters, a 180--3000$\times$ reduction compared to competitive methods, BETA achieves performance on par with or even exceeding white-box CL methods. We believe Black-CL and BETA provide a foundational framework for future advancements in continual learning and accelerates the transition of continual learning from academia to real-world systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Black-CL benchmark for black-box continual learning of vision-language models, enforcing constraints of weight/architecture inaccessibility, constrained computation, and task-agnostic inference. It proposes BETA, which optimizes only textual prototypes via three components—Semantic Projection Accumulation (SPA), Latent Distribution Replay (LDR), and Test-Time Prototype Adaptation (TTPA)—and claims that with 0.05M trainable parameters (180–3000× reduction vs. competitors) BETA matches or exceeds white-box CL methods across ten datasets and multiple backbones.

Significance. If the performance claims hold under the stated black-box constraints, the work is significant for enabling continual learning on deployed cloud-hosted VLMs where white-box access is unavailable. The extreme parameter efficiency and the modeling choice of textual prototypes alone constitute a practical baseline that could accelerate real-world adoption; the benchmark itself also standardizes evaluation in this constrained regime.

major comments (2)
  1. [Abstract] Abstract: the central claim that BETA 'achieves performance on par with or even exceeding white-box CL methods' is load-bearing. The abstract provides no quantitative metrics, specific white-box baselines, or per-dataset breakdowns; without these in the results section the claim cannot be assessed for effect size or consistency.
  2. [Abstract] Abstract (key insight paragraph): the assertion that 'solely optimizing textual prototypes can navigate the complexities of CL' is the foundational modeling choice. The manuscript must include ablations that isolate the contribution of prototype optimization from SPA, LDR, and TTPA to confirm this is not an artifact of the auxiliary components.
minor comments (2)
  1. The abstract states experiments use 'various backbones' but does not enumerate them or report per-backbone results; this should be added to the experimental protocol for reproducibility.
  2. Notation for the parameter reduction ('180--3000$ imes$') is clear in LaTeX but should be accompanied by an explicit table of trainable-parameter counts for each compared method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that BETA 'achieves performance on par with or even exceeding white-box CL methods' is load-bearing. The abstract provides no quantitative metrics, specific white-box baselines, or per-dataset breakdowns; without these in the results section the claim cannot be assessed for effect size or consistency.

    Authors: We agree that the abstract would benefit from greater specificity to support the central claim. In the revised manuscript, we have updated the abstract to include key quantitative metrics (e.g., average accuracy across the ten datasets and the reported 180–3000× parameter reduction) along with explicit references to the white-box baselines and per-dataset results already detailed in Section 4. The results section contains the full breakdowns, effect sizes, and consistency analysis across backbones; we have added cross-references from the abstract to these tables and figures for clarity. revision: yes

  2. Referee: [Abstract] Abstract (key insight paragraph): the assertion that 'solely optimizing textual prototypes can navigate the complexities of CL' is the foundational modeling choice. The manuscript must include ablations that isolate the contribution of prototype optimization from SPA, LDR, and TTPA to confirm this is not an artifact of the auxiliary components.

    Authors: We acknowledge the value of isolating the core modeling choice. The revised manuscript now includes dedicated ablation studies (new Table X and Figure Y in Section 4.3) that evaluate (i) a minimal variant performing only textual prototype optimization without SPA, LDR, or TTPA, (ii) incremental addition of each component, and (iii) full BETA. These results demonstrate that prototype optimization alone yields competitive performance under the black-box constraints, while the three components provide further gains, thereby substantiating the foundational insight. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent experimental validation

full rationale

The paper introduces the Black-CL benchmark and BETA method as an empirical baseline for black-box continual learning, relying on three components (SPA, LDR, TTPA) that optimize textual prototypes. No derivation chain, first-principles predictions, or mathematical reductions are claimed; performance claims rest on experiments across ten datasets and multiple backbones rather than any closed-form equivalence to inputs. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The central claim (0.05M parameters achieving white-box parity) is presented as an empirical result, not a definitional or self-referential construction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of textual prototypes under black-box constraints; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5832 in / 1107 out tokens · 16146 ms · 2026-06-26T09:28:22.241429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 11 linked inside Pith

  1. [1]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  2. [2]

    Open-vocabulary semantic segmentation with mask-adapted clip,

    F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7061–7070

  3. [3]

    Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,

    X. Wu, F. Zhu, R. Zhao, and H. Li, “Cora: Adapting clip for open- vocabulary detection with region prompting and anchor pre-matching,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7031–7040

  4. [4]

    Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,

    Q. Yu, J. He, X. Deng, X. Shen, and L.-C. Chen, “Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip,”Advances in Neural Information Processing Systems, vol. 36, pp. 32 215–32 234, 2023

  5. [5]

    Self- calibrated clip for training-free open-vocabulary segmentation,

    S. Bai, Y . Liu, Y . Han, H. Zhang, Y . Tang, J. Zhou, and J. Lu, “Self- calibrated clip for training-free open-vocabulary segmentation,”IEEE Transactions on Image Processing, 2025

  6. [6]

    icarl: Incremental classifier and representation learning,

    S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” inProceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017

  7. [7]

    Re-evaluating continual learning scenarios: A categorization and case for strong baselines,

    Y .-C. Hsu, Y .-C. Liu, A. Ramasamy, and Z. Kira, “Re-evaluating continual learning scenarios: A categorization and case for strong baselines,”arXiv preprint arXiv:1810.12488, 2018

  8. [8]

    Preventing zero-shot transfer degradation in continual learning of vision-language models,

    Z. Zheng, M. Ma, K. Wang, Z. Qin, X. Yue, and Y . You, “Preventing zero-shot transfer degradation in continual learning of vision-language models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 19 125–19 136

  9. [9]

    Boosting continual learning of vision-language models via mixture-of-experts adapters,

    J. Yu, Y . Zhuge, L. Zhang, P. Hu, D. Wang, H. Lu, and Y . He, “Boosting continual learning of vision-language models via mixture-of-experts adapters,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 23 219–23 230

  10. [10]

    Ad- vancing cross-domain discriminability in continual learning of vision- language models,

    Y . Xu, Y . Chen, J. Nie, Y . Wang, H. Zhuang, and M. Okumura, “Ad- vancing cross-domain discriminability in continual learning of vision- language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 51 552–51 576, 2024

  11. [11]

    Lada: Scalable label-specific clip adapter for continual learning,

    M.-L. Luo, Z.-H. Zhou, T. Wei, and M.-L. Zhang, “Lada: Scalable label-specific clip adapter for continual learning,” 2025. [Online]. Available: https://arxiv.org/abs/2505.23271

  12. [12]

    Gpt-4 technical report,

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  13. [13]

    Gemini: a family of highly capable multimodal models,

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  14. [14]

    Black-box tuning for language-model-as-a-service,

    T. Sun, Y . Shao, H. Qian, X. Huang, and X. Qiu, “Black-box tuning for language-model-as-a-service,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 20 841–20 855

  15. [15]

    Bbtv2: Towards a gradient-free future with large language models,

    T. Sun, Z. He, H. Qian, Y . Zhou, X.-J. Huang, and X. Qiu, “Bbtv2: Towards a gradient-free future with large language models,” inProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3916–3930

  16. [16]

    Black- box prompt learning for pre-trained language models,

    S. Diao, Z. Huang, R. Xu, X. Li, Y . Lin, X. Zhou, and T. Zhang, “Black- box prompt learning for pre-trained language models,”arXiv preprint arXiv:2201.08531, 2022

  17. [17]

    Blackvip: Black-box visual prompting for robust transfer learning,

    C. Oh, H. Hwang, H.-y. Lee, Y . Lim, G. Jung, J. Jung, H. Choi, and K. Song, “Blackvip: Black-box visual prompting for robust transfer learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 224–24 235

  18. [18]

    Black box few-shot adaptation for vision-language models,

    Y . Ouali, A. Bulat, B. Matinez, and G. Tzimiropoulos, “Black box few-shot adaptation for vision-language models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 534–15 546

  19. [19]

    Black-box tuning of vision-language models with effective gradient approxima- tion,

    Z. Guo, Y . Wei, M. Liu, Z. Ji, J. Bai, Y . Guo, and W. Zuo, “Black-box tuning of vision-language models with effective gradient approxima- tion,”arXiv preprint arXiv:2312.15901, 2023

  20. [20]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 11 975–11 986

  21. [21]

    Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdul- mohsin, N. Parthasarathy, T. Evans, L. Beyer, Y . Xia, B. Mustafa et al., “Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features,”arXiv preprint arXiv:2502.14786, 2025

  22. [22]

    A continual learning survey: Defying forgetting in classification tasks,

    M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying forgetting in classification tasks,”IEEE transactions on pattern analysis and machine intelligence (TPAMI), 2021

  23. [23]

    A comprehensive survey of continual learning: Theory, method and application,

    L. Wang, X. Zhang, H. Su, and J. Zhu, “A comprehensive survey of continual learning: Theory, method and application,”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5362– 5383, 2024

  24. [24]

    A practitioner’s guide to continual multimodal pretraining,

    K. Roth, V . Udandarao, S. Dziadzio, A. Prabhu, M. Cherti, O. Vinyals, O. H´enaff, S. Albanie, M. Bethge, and Z. Akata, “A practitioner’s guide to continual multimodal pretraining,”arXiv preprint arXiv:2408.14471, 2024

  25. [25]

    Dualprompt: Complementary prompting for rehearsal-free continual learning,

    Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y . Lee, X. Ren, G. Su, V . Perot, J. Dyet al., “Dualprompt: Complementary prompting for rehearsal-free continual learning,” inEuropean Conference on Computer Vision (ECCV), 2022

  26. [26]

    Coda-prompt: Continual decom- posed attention-based prompting for rehearsal-free continual learning,

    J. S. Smith, L. Karlinsky, V . Gutta, P. Cascante-Bonilla, D. Kim, A. Ar- belle, R. Panda, R. Feris, and Z. Kira, “Coda-prompt: Continual decom- posed attention-based prompting for rehearsal-free continual learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2023

  27. [27]

    When prompt-based incre- mental learning does not meet strong pretraining,

    Y .-M. Tang, Y .-X. Peng, and W.-S. Zheng, “When prompt-based incre- mental learning does not meet strong pretraining,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  28. [28]

    Evolving parameterized prompt memory for continual learning,

    M. R. Kurniawan, X. Song, Z. Ma, Y . He, Y . Gong, Y . Qi, and X. Wei, “Evolving parameterized prompt memory for continual learning,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024

  29. [29]

    Ider: Idempotent experience replay for reliable continual learning,

    Z. Liu, Y . Li, H. Gao, Y . Li, L. Kong, L. Sun, and W. Huang, “Ider: Idempotent experience replay for reliable continual learning,”arXiv preprint arXiv:2603.00624, 2026

  30. [30]

    Gradient episodic memory for continual learning,

    D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,”Advances in neural information processing systems, vol. 30, 2017

  31. [31]

    Dark experience for general continual learning: a strong, simple baseline,

    P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara, “Dark experience for general continual learning: a strong, simple baseline,” Advances in neural information processing systems, vol. 33, pp. 15 920– 15 930, 2020

  32. [32]

    Overcoming catastrophic forgetting in neural networks,

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,”Pro- ceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017

  33. [33]

    Continual learning through synaptic intelligence,

    F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” inInternational conference on machine learning (ICML), 2017

  34. [34]

    Memory aware synapses: Learning what (not) to forget,

    R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars, “Memory aware synapses: Learning what (not) to forget,” inProceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 139–154

  35. [35]

    Progressive neural networks,

    A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,”arXiv preprint arXiv:1606.04671, 2016

  36. [36]

    Lifelong learning with dynamically expandable networks,

    J. Yoon, E. Yang, J. Lee, and S. J. Hwang, “Lifelong learning with dynamically expandable networks,” inInternational Conference on Learning Representations (ICLR), 2018

  37. [37]

    Packnet: Adding multiple tasks to a single network by iterative pruning,

    A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single network by iterative pruning,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7765– 7773

  38. [38]

    Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models,

    Y .-C. Yu, C.-P. Huang, J.-J. Chen, K.-P. Chang, Y .-H. Lai, F.-E. Yang, and Y .-C. F. Wang, “Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 219–236

  39. [39]

    Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models,

    L. Tang, Z. Tian, K. Li, C. He, H. Zhou, H. Zhao, X. Li, and J. Jia, “Mind the interference: Retaining pre-trained knowledge in parameter efficient continual learning of vision-language models,” inEuropean conference on computer vision. Springer, 2024, pp. 346–365

  40. [40]

    Prior convictions: Black- box adversarial attacks with bandits and priors,

    A. Ilyas, L. Engstrom, and A. Madry, “Prior convictions: Black- box adversarial attacks with bandits and priors,”arXiv preprint arXiv:1807.07978, 2018

  41. [41]

    Black-box adversarial at- tacks with limited queries and information,

    A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial at- tacks with limited queries and information,” inInternational Conference on Machine Learning. PMLR, 2018, pp. 2137–2146. SUBMITTED TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

  42. [42]

    Black-box adversarial attack with transferable model-based embedding,

    Z. Huang and T. Zhang, “Black-box adversarial attack with transferable model-based embedding,”arXiv preprint arXiv:1911.07140, 2019

  43. [43]

    Square at- tack: A query-efficient black-box adversarial attack via random search,

    M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein, “Square at- tack: A query-efficient black-box adversarial attack via random search,” inEuropean Conference on Computer Vision. Springer, 2020, pp. 484– 501

  44. [44]

    Improving black- box adversarial attacks with a transfer-based prior,

    S. Cheng, Y . Dong, T. Pang, H. Su, and J. Zhu, “Improving black- box adversarial attacks with a transfer-based prior,”Advances in Neural Information Processing Systems, vol. 32, 2019

  45. [45]

    Natural evolution strategies,

    D. Wierstra, T. Schaul, T. Glasmachers, Y . Sun, J. Peters, and J. Schmid- huber, “Natural evolution strategies,”The Journal of Machine Learning Research, vol. 15, no. 1, pp. 949–980, 2014

  46. [46]

    Policy gradient methods for reinforcement learning with function approximation,

    R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in Neural Information Processing Systems, vol. 12, 1999

  47. [47]

    Black-box forgetting,

    Y . Kuwana, Y . Goto, T. Shibata, and G. Irie, “Black-box forgetting,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 58 792– 58 815, 2024

  48. [48]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” University of Toronto, Toronto, ON, Canada, Tech. Rep., 2009

  49. [49]

    Conditional prompt learning for vision-language models,

    K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” inProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, 2022, pp. 16 816– 16 825

  50. [50]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,

    L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in2004 conference on computer vision and pattern recognition workshop. IEEE, 2004, pp. 178–178

  51. [51]

    Cats and dogs,

    O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in2012 IEEE conference on computer vision and pattern recog- nition. IEEE, 2012, pp. 3498–3505

  52. [52]

    3d object representations for fine-grained categorization,

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” inProceedings of the IEEE interna- tional conference on computer vision workshops, 2013, pp. 554–561

  53. [53]

    Automated flower classification over a large number of classes,

    M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008, pp. 722–729

  54. [54]

    Food-101–mining dis- criminative components with random forests,

    L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining dis- criminative components with random forests,” inEuropean conference on computer vision. Springer, 2014, pp. 446–461

  55. [55]

    Fine- grained visual classification of aircraft,

    S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine- grained visual classification of aircraft,”arXiv preprint arXiv:1306.5151, 2013

  56. [56]

    Sun database: Large-scale scene recognition from abbey to zoo,

    J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010, pp. 3485–3492

  57. [57]

    Describing textures in the wild,

    M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613

  58. [58]

    Eurosat: A novel dataset and deep learning benchmark for land use and land cover classi- fication,

    P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classi- fication,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019

  59. [59]

    Ucf101: A dataset of 101 human actions classes from videos in the wild,

    K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,”arXiv preprint arXiv:1212.0402, 2012

  60. [60]

    The mnist database of handwritten digit images for machine learning research [best of the web],

    L. Deng, “The mnist database of handwritten digit images for machine learning research [best of the web],”IEEE signal processing magazine, vol. 29, no. 6, pp. 141–142, 2012

  61. [61]

    Learning without forgetting,

    Z. Li and D. Hoiem, “Learning without forgetting,” 2017. [Online]. Available: https://arxiv.org/abs/1606.09282

  62. [62]

    Robust fine-tuning of zero-shot models,

    M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoonget al., “Robust fine-tuning of zero-shot models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 7959– 7971

  63. [63]

    InfoNCE induces gaussian distribution,

    R. Betser, E. Gofer, M. Y . Levi, and G. Gilboa, “InfoNCE induces gaussian distribution,”arXiv preprint arXiv:2602.24012, 2026

  64. [64]

    LeJEPA: Provable and scalable self-supervised learning without the heuristics,

    R. Balestriero and Y . LeCun, “LeJEPA: Provable and scalable self-supervised learning without the heuristics,”arXiv preprint arXiv:2511.08544, 2025

  65. [65]

    Tent: Fully test-time adaptation by entropy minimization,

    D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,”arXiv preprint arXiv:2006.10726, 2020

  66. [66]

    Continual test-time domain adaptation,

    Q. Wang, O. Fink, L. Van Gool, and D. Dai, “Continual test-time domain adaptation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 7201–7211

  67. [67]

    Model-order selection: A review of information criterion rules,

    P. Stoica and Y . Selen, “Model-order selection: A review of information criterion rules,”IEEE Signal Processing Magazine, vol. 21, no. 4, pp. 36–47, 2004

  68. [68]

    Regularizing CNN transfer learning with randomised regression,

    Y . Zhong and A. Maki, “Regularizing CNN transfer learning with randomised regression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 637–13 646. Yuting Liis currently a Ph.D. student in the School of Computer Science, Shanghai Jiao Tong University, Shanghai, China. His research focuses on continual lear...