pith. sign in

arxiv: 2605.29380 · v1 · pith:TKGBXTNRnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CV

TRACER: Persistent Regularization for Robust Multimodal Finetuning

Pith reviewed 2026-06-29 08:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords multimodal finetuningcontrastive learningout-of-distribution robustnesscatastrophic forgettingself-distillationweighted moving averageregularizationCLIP
0
0 comments X

The pith

A weighted moving average teacher prevents collapse and enables persistent regularization in multimodal contrastive finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework for multimodal contrastive finetuning that provides closed-form solutions and geometric decomposition for regularization strategies. It reveals that standard EMA teachers suffer from collapse, while a WMA teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. This motivates the TRACER method, which combines contrastive learning with WMA-guided multi-perspective distillation. If true, this would allow finetuning of models like CLIP with better out-of-distribution accuracy and calibration without catastrophic forgetting of pretrained knowledge. A sympathetic reader would care because maintaining robustness during adaptation is a key challenge in deploying pretrained multimodal models.

Core claim

TRACER combines contrastive learning with WMA-guided multi-perspective distillation to achieve consistent OOD accuracy and calibration gains. The theoretical framework shows self-distillation is more effective than other approaches, EMA teachers collapse, but WMA maintains persistent force and bias-free convergence while preserving orthogonal knowledge. Experiments on CLIP finetuning across three backbones confirm the gains and robustness to hyperparameters.

What carries the argument

The WMA teacher that maintains a persistent regularizing force over finite horizons, yields bias-free convergence in the task subspace, and preserves orthogonal knowledge.

If this is right

  • Self-distillation outperforms other regularization approaches in retaining pretrained model knowledge.
  • TRACER yields consistent OOD accuracy and calibration improvements across multiple backbone architectures.
  • The method is robust to hyperparameter choices.
  • Geometric decomposition provides closed-form solutions for analyzing each strategy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar persistent regularization techniques could apply to other types of model adaptation beyond multimodal contrastive finetuning.
  • The analysis of teacher collapse might inform designs in knowledge distillation for other domains like natural language processing.
  • Testing the framework on additional datasets or tasks could reveal further benefits or limitations.

Load-bearing premise

The geometric decomposition and closed-form solutions accurately capture the dynamics of regularization strategies including EMA collapse and WMA persistent force.

What would settle it

Observing that an EMA teacher does not collapse or that a WMA teacher fails to provide persistent regularization over finite horizons in experiments would falsify the theoretical claims.

Figures

Figures reproduced from arXiv: 2605.29380 by Christopher Leckie, Feng Liu, Hesam Asadollahzadeh, Sarah M. Erfani.

Figure 1
Figure 1. Figure 1: Overview of TRACER. The base contrastive objective is combined with a dynamic self-distillation loss from a Weighted Moving Average (WMA) teacher to preserve orthogonal pretrained knowledge while adaptively mixing within the task subspace. θ 0 CLIP represents the initial pretrained CLIP model. θ t denotes the student model at time t, with its image and text encoder (EImage and EText) being trained. The stu… view at source ↗
Figure 2
Figure 2. Figure 2: Geometric interpretation of finetuning strategies in 2D weight space. The green line represents span(X⊤ I ), the subspace where finetuning data concentrates. Starting from pretrained weights W0 I (blue), each method combines the orthogonal component W0 I (I− PI ) and the new task solution W⋆ FT = YFTX⊤ I (XIX⊤ I ) + (green) differently: (a) Direct FT preserves the orthogonal component and replaces the para… view at source ↗
Figure 3
Figure 3. Figure 3: Toy Experiment. We compare a pretrained model against four finetuning methods on a finetuning task. (a) Performance on the original MNIST and new Colored MNIST task. All finetuning methods successfully learn the new task. Direct FT and L2 Reg suffer severe performance degradation (catastrophic forgetting). (b) Catastrophic forgetting rate, quantified as the percentage drop in accuracy on the original task.… view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise Representational Similarity. We compare the internal representations of the Pretrained model against Direct FT and TRACER using CKA (left) and SVCCA (right) across all layers of the CLIP ViT-B/16 image encoder. TRACER (gold) preserves the geometric structure of the pretrained knowledge significantly better than Direct FT (pink), particularly in deeper layers. 0 1000 2000 3000 4000 5000 Training … view at source ↗
Figure 5
Figure 5. Figure 5: Teacher–Student Knowledge Gap During Training. Compared to the EMA teacher (blue), which shows rapidly vanishing KL divergence and thus a weakening regularization signal (left), the WMA teacher (orange) sustains a higher and more stable KL gap. This stability is supported by higher teacher entropy (middle) and moderated confidence (right), preventing overfitting. Together, these trends confirm that WMA pro… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of Teacher Dynamics (Update Frequency = 1). We track the evolution of the teacher model for CaRot (EMA) and TRACER (WMA) when updated at every step. The EMA teacher (blue) rapidly collapses onto the student (KL → 0), losing its regularizing capability. The WMA teacher (orange) maintains a persistent, stable gap, providing continuous regularization without needing brittle update schedules. B. Add… view at source ↗
read the original abstract

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper develops a theoretical framework for multimodal contrastive finetuning of pretrained models (e.g., CLIP), providing closed-form solutions and a geometric decomposition of regularization strategies. It identifies collapse in standard EMA teachers and proves that a Weighted Moving Average (WMA) teacher maintains persistent orthogonal regularization over finite horizons with bias-free convergence in the task subspace. These insights motivate TRACER, which augments contrastive learning with WMA-guided multi-perspective self-distillation. Experiments across three backbone architectures report consistent gains in OOD accuracy and calibration, with ablations supporting robustness to hyperparameters; code is released.

Significance. If the closed-form derivations and geometric claims hold under realistic finetuning conditions, the work supplies a principled explanation for why self-distillation outperforms other regularizers and directly motivates a practical replacement for EMA. This could influence robust finetuning practices for vision-language models where OOD reliability matters. Reproducibility via the linked GitHub repository is a clear strength.

major comments (2)
  1. [theoretical framework] Theoretical framework (closed-form solutions and geometric decomposition): the derivations establishing EMA collapse and WMA's persistent orthogonal force appear to rely on assumptions such as linearity of the encoder dynamics, infinite data, or exact task-subspace orthogonality. These may not hold for CLIP's non-linear encoders, batch noise, or finite-horizon training; if violated, the claimed superiority of WMA over EMA does not necessarily follow even if empirical gains are observed.
  2. [theoretical framework and § on TRACER] Motivation for TRACER (WMA-guided multi-perspective distillation): the central claim that this combination yields consistent OOD gains rests on the geometric decomposition accurately capturing each regularization strategy. Without explicit verification that the derived bias-free convergence and persistent force survive the non-linear, stochastic setting of actual CLIP finetuning, the link between theory and the reported experimental improvements remains unestablished.
minor comments (1)
  1. [abstract] The abstract and introduction would benefit from a brief statement of the precise assumptions under which the closed-form solutions are derived (e.g., linearity, data regime).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the theoretical framework and its connection to TRACER. We address each point below, clarifying the role of the assumptions and the intended link to experiments.

read point-by-point responses
  1. Referee: Theoretical framework (closed-form solutions and geometric decomposition): the derivations establishing EMA collapse and WMA's persistent orthogonal force appear to rely on assumptions such as linearity of the encoder dynamics, infinite data, or exact task-subspace orthogonality. These may not hold for CLIP's non-linear encoders, batch noise, or finite-horizon training; if violated, the claimed superiority of WMA over EMA does not necessarily follow even if empirical gains are observed.

    Authors: The closed-form solutions and geometric decomposition are derived under linear dynamics and infinite-data assumptions to obtain exact characterizations of regularization behavior; this is a standard approach for analytical insight. We agree these assumptions are idealized relative to non-linear CLIP encoders and stochastic batch training. The framework is presented as providing intuition and motivation rather than a complete non-linear proof. The empirical results on real CLIP models are offered as supporting evidence that the predicted advantages of WMA materialize in practice. We will add an explicit limitations paragraph discussing the scope of the assumptions. revision: partial

  2. Referee: Motivation for TRACER (WMA-guided multi-perspective distillation): the central claim that this combination yields consistent OOD gains rests on the geometric decomposition accurately capturing each regularization strategy. Without explicit verification that the derived bias-free convergence and persistent force survive the non-linear, stochastic setting of actual CLIP finetuning, the link between theory and the reported experimental improvements remains unestablished.

    Authors: The theory supplies a principled motivation for replacing EMA with WMA and for the multi-perspective distillation design; the experiments (including cross-architecture ablations) then test whether these design choices deliver the expected OOD and calibration benefits. We do not claim a rigorous non-linear stochastic proof of the derived quantities. The consistent gains and hyperparameter robustness are presented as empirical corroboration of the framework's practical utility. We will revise the motivation section to state more clearly that the geometric analysis provides intuition while the experiments serve as the primary validation. revision: partial

Circularity Check

0 steps flagged

No circularity detected; theoretical claims presented as independent derivations

full rationale

The provided abstract and context describe a theoretical framework yielding closed-form solutions and geometric decompositions for regularization strategies in multimodal contrastive finetuning, including analysis of EMA collapse and WMA persistence. No equations, self-citations, fitted parameters renamed as predictions, or self-referential definitions are exhibited in the text. The derivation chain is presented as first-principles analysis without reduction to inputs by construction, and the central empirical claims rest on experiments rather than tautological predictions. This is the expected honest non-finding when no load-bearing circular steps can be quoted.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit details on free parameters, axioms, or invented entities; full manuscript unavailable for audit.

pith-pipeline@v0.9.1-grok · 5794 in / 1056 out tokens · 27380 ms · 2026-06-29T08:48:28.331192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 18 canonical work pages · 3 internal anchors

  1. [1]

    Zavrtanik, M

    IEEE, 2021. doi: 10.1109/ICCV48922.2021.00951. URL https://doi.org/10.1109/ICCV48922. 2021.00951. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248...

  2. [2]

    Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L

    URL https://openreview.net/forum? id=YicbFdNTTy. Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L. Data determines distri- butional robustness in contrastive language image pre- training (CLIP). InInternational Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 ofProcee...

  3. [3]

    Black, and Otmar Hilliges

    URL https://proceedings.mlr.press/ v162/fang22a.html. 10 TRACER: Persistent Regularization for Robust Multimodal Finetuning Fang, A., Jose, A. M., Jain, A., Schmidt, L., Toshev, A. T., and Shankar, V . Data filtering networks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, ...

  4. [4]

    Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y

    URL http://proceedings.mlr.press/ v80/furlanello18a.html. Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y . On the duality between contrastive and non- contrastive self-supervised learning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe- view.net, 2023. URL https://openr...

  5. [5]

    HaoChen, J

    URL https://openreview.net/forum? id=AuEgNlEAmed. HaoChen, J. Z., Wei, C., Kumar, A., and Ma, T. Beyond separability: Analyzing the linear transferability of con- trastive representations to related subpopulations. InAd- vances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New O...

  6. [6]

    URL http://proceedings.mlr.press/ v97/houlsby19a.html. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adapta- tion of large language models. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

  7. [7]

    Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L

    URL https://openreview.net/forum? id=nZeVKeeFYf9. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. Openclip, July 2021. URL https://doi.org/10. 5281/zenodo.5143773. If you use this software, please cite it as below. Izmailov, P., Po...

  8. [8]

    cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper

    URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper. pdf. Jang, D., Yun, S., and Han, D. Model stock: All we need is just a few fine-tuned models. InComputer Vi- sion - ECCV 2024 - 18th European Conference, Mi- lan, Italy, September 29-October 4, 2024, Proceed- ings, Part XLIV, volume 15102 ofLecture Not...

  9. [9]

    Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S

    URL http://proceedings.mlr.press/ v139/jia21b.html. Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S. J., Hariharan, B., and Lim, S. Visual prompt tuning. In Avidan, S., Brostow, G. J., Cissé, M., Farinella, G. M., and Hassner, T. (eds.),Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX...

  10. [10]

    Kumar, A., Raghunathan, A., Jones, R

    URL http://proceedings.mlr.press/ v97/kornblith19a.html. Kumar, A., Raghunathan, A., Jones, R. M., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

  11. [11]

    Lecun, L

    URL https://openreview.net/forum? id=UYneFzXSJWh. Laine, S. and Aila, T. Temporal ensembling for semi- supervised learning. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, 2017. URL https://openreview. net/forum?id=BJ6oOfqge. LeCun, Y ., Bottou, L., B...

  12. [12]

    In: IEEE/CVF International Conference on Computer Vision

    PMLR, 2018. URL http://proceedings. mlr.press/v80/li18a.html. Li, X., Fang, Y ., Liu, M., Ling, Z., Tu, Z., and Su, H. Distill- ing large vision-language model with out-of-distribution generalizability. InIEEE/CVF International Confer- ence on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 2492–2503. IEEE, 2023a. doi: 10.1109/ICCV51070....

  13. [13]

    NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

    URL https://doi.org/10.18653/v1/ 2021.acl-long.353. Li, Y ., Fan, H., Hu, R., Feichtenhofer, C., and He, K. Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 23390–23400. IEEE, 2023c. doi: 10.1109/CVPR52729.2023.02240. URL https:// d...

  14. [14]

    and Hoiem, D

    doi: 10.1109/TPAMI.2017.2773081. URL https: //doi.org/10.1109/TPAMI.2017.2773081. Li, Z., Li, X., Fu, X., Zhang, X., Wang, W., Chen, S., and Yang, J. Promptkd: Unsupervised prompt distillation for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pp. 26607–26616. IEEE, ...

  15. [15]

    URL https://proceedings

    PMLR, 2023. URL https://proceedings. mlr.press/v206/nakada23a.html. Nam, G., Heo, B., and Lee, J. Lipsum-ft: Robust fine-tuning of zero-shot models using random text guidance. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview. net/forum?id=2JF8mJRJ7...

  16. [16]

    Radford, A., Kim, J

    URL https://openreview.net/forum? id=M0MF4t3hE9. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning trans- ferable visual models from natural language supervi- sion. InProceedings of the 38th International Con- ference on Machine Learning, ICML 202...

  17. [17]

    Progressive Neural Networks

    URL http://proceedings.mlr.press/ v97/recht19a.html. Robins, A. Catastrophic forgetting, rehearsal and pseudore- hearsal.Connection Science, 7(2):123–146, 1995. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei-Fei, L. Ima- genet large scale visual recognition c...

  18. [18]

    URL http://proceedings.mlr.press/ v97/saunshi19a.html. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crow- son, K., Schmidt, L., Kaczmarczyk, R., and Jitsev, J. LAION-5B: an open large-scale dataset for training next generation image-text mode...

  19. [20]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    URL https://openreview.net/forum? id=HkxCzeHFDB. Tschannen, M., Gritsenko, A. A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., Hénaff, O. J., Harmsen, J., Steiner, A., and Zhai, X. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense fe...

  20. [22]

    2026 Sulfur fractionation in coronal plumes as observed by Solar Orbiter/SPICE

    URL http://proceedings.mlr.press/ v119/wang20k.html. Wang, Z., Codella, N., Chen, Y ., Zhou, L., Dai, X., Xiao, B., Yang, J., You, H., Chang, K., Chang, S., and Yuan, L. Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. CoRR, abs/2204.10496, 2022b. doi: 10.48550/ARXIV . 2204.10496. URLhttps://doi.org/10.48550/ ar...

  21. [23]

    Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B

    URL https://proceedings.mlr.press/ v202/xue23d.html. Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B. Un- derstanding the robustness of multi-modal contrastive learning to distribution shift. InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  22. [24]

    Reiss, N

    URL https://openreview.net/forum? id=rtl4XnJYBh. Yan, S., Xie, J., and He, X. DER: dynamically expandable representation for class incremental learning. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 3014–3023. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00303. URL https: //...

  23. [25]

    Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B

    URL https://openreview.net/forum? id=Ee277P3AYC. Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B. Multimodal contrastive training for visual representation learning. InIEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2021, virtual, June 19-25, 2021, pp. 6995–7004. Computer Vision Foundation / IEEE,

  24. [26]

    Reiss, N

    doi: 10.1109/CVPR46437.2021.00692. URL https://openaccess.thecvf. com/content/CVPR2021/html/Yuan_ Multimodal_Contrastive_Training_for_ Visual_Representation_Learning_CVPR_ 2021_paper.html. Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. InProceed- ings of the 34th International Conference on Ma- chine Learning, ICML...

  25. [27]

    Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L

    URL http://proceedings.mlr.press/ v70/zenke17a.html. Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vi- sion, ICCV 2023, Paris, France, October 1-6, 2023, pp. 11941–11952. IEEE, 2023. doi: 10.1109/ICCV51070. 2023.01100. URL https://doi.org/10.1109/ ICCV510...

  26. [28]

    modifies the contrastive loss by using a sigmoid function instead of softmax, and FLIP (Li et al., 2023c) integrates masking strategies to accelerate training. A.2. Theory of Contrastive Learning A rich theoretical literature analyzes contrastive learning from first principles, characterizing when and why contrastive objectives recover useful features and...

  27. [29]

    This holds if and only if the rows of A are in the null space of CI

    Null Space Component:The null space of Q consists of matrices A such that Q(A) =AC I =0 . This holds if and only if the rows of A are in the null space of CI. The orthogonal projector onto this component of the initial matrix W0 I is ΠQ⊥(W0 I) =W 0 I(I− P I), whereP I =C IC+ I is the projector onto the row space ofX I. This component is preserved

  28. [30]

    This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I

    Range Component:The pseudoinverse Q+ finds the minimum Frobenius norm solution to Q(W) =P that lies in Range(Q). This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I . Combining the components gives the final solution: WFT =W 0 I(I− P I) +Y FTX⊤ I (XIX⊤ I )+. L2 Regularization.The objective L(WI) = 1 2 ∥WIXI −Y FT∥2 F + λ 2 WI −W 0 I 2 ...

  29. [31]

    best of both worlds

    Range Component:The pseudoinverse is Q+ SD = 1 1+λ Q+, where Q+ corresponds to the direct finetuning case. Applying it to PSD: Q+ SD(PSD) = 1 1 +λ Q+ YFTX⊤ I +λW 0 ICI = 1 1 +λ (YFTX⊤ I )C+ I +λQ +(Q(W0 I)) = 1 1 +λ (YFTX⊤ I )C+ I +λΠ Q(W0 I) = 1 1 +λ YFTX⊤ I (XIX⊤ I )+ +λW 0 I PI . 29 TRACER: Persistent Regularization for Robust Multimodal Finetuning Com...