TRACER: Persistent Regularization for Robust Multimodal Finetuning

Christopher Leckie; Feng Liu; Hesam Asadollahzadeh; Sarah M. Erfani

arxiv: 2605.29380 · v1 · pith:TKGBXTNRnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI· cs.CV

TRACER: Persistent Regularization for Robust Multimodal Finetuning

Hesam Asadollahzadeh , Feng Liu , Christopher Leckie , Sarah M. Erfani This is my paper

Pith reviewed 2026-06-29 08:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords multimodal finetuningcontrastive learningout-of-distribution robustnesscatastrophic forgettingself-distillationweighted moving averageregularizationCLIP

0 comments

The pith

A weighted moving average teacher prevents collapse and enables persistent regularization in multimodal contrastive finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a theoretical framework for multimodal contrastive finetuning that provides closed-form solutions and geometric decomposition for regularization strategies. It reveals that standard EMA teachers suffer from collapse, while a WMA teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. This motivates the TRACER method, which combines contrastive learning with WMA-guided multi-perspective distillation. If true, this would allow finetuning of models like CLIP with better out-of-distribution accuracy and calibration without catastrophic forgetting of pretrained knowledge. A sympathetic reader would care because maintaining robustness during adaptation is a key challenge in deploying pretrained multimodal models.

Core claim

TRACER combines contrastive learning with WMA-guided multi-perspective distillation to achieve consistent OOD accuracy and calibration gains. The theoretical framework shows self-distillation is more effective than other approaches, EMA teachers collapse, but WMA maintains persistent force and bias-free convergence while preserving orthogonal knowledge. Experiments on CLIP finetuning across three backbones confirm the gains and robustness to hyperparameters.

What carries the argument

The WMA teacher that maintains a persistent regularizing force over finite horizons, yields bias-free convergence in the task subspace, and preserves orthogonal knowledge.

If this is right

Self-distillation outperforms other regularization approaches in retaining pretrained model knowledge.
TRACER yields consistent OOD accuracy and calibration improvements across multiple backbone architectures.
The method is robust to hyperparameter choices.
Geometric decomposition provides closed-form solutions for analyzing each strategy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar persistent regularization techniques could apply to other types of model adaptation beyond multimodal contrastive finetuning.
The analysis of teacher collapse might inform designs in knowledge distillation for other domains like natural language processing.
Testing the framework on additional datasets or tasks could reveal further benefits or limitations.

Load-bearing premise

The geometric decomposition and closed-form solutions accurately capture the dynamics of regularization strategies including EMA collapse and WMA persistent force.

What would settle it

Observing that an EMA teacher does not collapse or that a WMA teacher fails to provide persistent regularization over finite horizons in experiments would falsify the theoretical claims.

Figures

Figures reproduced from arXiv: 2605.29380 by Christopher Leckie, Feng Liu, Hesam Asadollahzadeh, Sarah M. Erfani.

**Figure 1.** Figure 1: Overview of TRACER. The base contrastive objective is combined with a dynamic self-distillation loss from a Weighted Moving Average (WMA) teacher to preserve orthogonal pretrained knowledge while adaptively mixing within the task subspace. θ 0 CLIP represents the initial pretrained CLIP model. θ t denotes the student model at time t, with its image and text encoder (EImage and EText) being trained. The stu… view at source ↗

**Figure 2.** Figure 2: Geometric interpretation of finetuning strategies in 2D weight space. The green line represents span(X⊤ I ), the subspace where finetuning data concentrates. Starting from pretrained weights W0 I (blue), each method combines the orthogonal component W0 I (I− PI ) and the new task solution W⋆ FT = YFTX⊤ I (XIX⊤ I ) + (green) differently: (a) Direct FT preserves the orthogonal component and replaces the para… view at source ↗

**Figure 3.** Figure 3: Toy Experiment. We compare a pretrained model against four finetuning methods on a finetuning task. (a) Performance on the original MNIST and new Colored MNIST task. All finetuning methods successfully learn the new task. Direct FT and L2 Reg suffer severe performance degradation (catastrophic forgetting). (b) Catastrophic forgetting rate, quantified as the percentage drop in accuracy on the original task.… view at source ↗

**Figure 4.** Figure 4: Layer-wise Representational Similarity. We compare the internal representations of the Pretrained model against Direct FT and TRACER using CKA (left) and SVCCA (right) across all layers of the CLIP ViT-B/16 image encoder. TRACER (gold) preserves the geometric structure of the pretrained knowledge significantly better than Direct FT (pink), particularly in deeper layers. 0 1000 2000 3000 4000 5000 Training … view at source ↗

**Figure 5.** Figure 5: Teacher–Student Knowledge Gap During Training. Compared to the EMA teacher (blue), which shows rapidly vanishing KL divergence and thus a weakening regularization signal (left), the WMA teacher (orange) sustains a higher and more stable KL gap. This stability is supported by higher teacher entropy (middle) and moderated confidence (right), preventing overfitting. Together, these trends confirm that WMA pro… view at source ↗

**Figure 6.** Figure 6: Comparison of Teacher Dynamics (Update Frequency = 1). We track the evolution of the teacher model for CaRot (EMA) and TRACER (WMA) when updated at every step. The EMA teacher (blue) rapidly collapses onto the student (KL → 0), losing its regularizing capability. The WMA teacher (orange) maintains a persistent, stable gap, providing continuous regularization without needing brittle update schedules. B. Add… view at source ↗

read the original abstract

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACER swaps EMA for WMA to avoid teacher collapse in multimodal finetuning and supplies a geometric analysis plus experiments on CLIP, but the theory's assumptions look like the main risk.

read the letter

TRACER replaces standard EMA teachers with a weighted moving average version inside a distillation setup for contrastive finetuning of models like CLIP. The central claim is that this change, plus multi-perspective distillation, keeps out-of-distribution accuracy and calibration from degrading.

The paper's main contribution is the theoretical framework that decomposes regularization strategies into closed-form solutions and a geometric view of the task subspace versus orthogonal directions. It shows why self-distillation outperforms other regularizers and identifies EMA collapse as a concrete failure mode, then proves WMA avoids that collapse while staying bias-free over finite horizons. Experiments run across three backbone architectures with ablations on hyperparameter sensitivity, and code is released.

Those elements are solid. The analysis gives a concrete reason to prefer one teacher update rule over another, and the experimental section checks robustness rather than just reporting a single best run.

The soft spot sits in the theory. The closed-form solutions and geometric decomposition rely on assumptions about orthogonality and dynamics that may not survive the non-linear encoders, batch noise, and finite training used in actual CLIP finetuning. If those assumptions are too strong, the motivation for WMA rests more on the empirical gains than on the derivations. The reported OOD improvements are described as consistent, but without effect sizes or head-to-head numbers against recent alternatives it is hard to judge how large the practical advance is.

This work is aimed at people who finetune multimodal models and care about distribution shift. A reader who wants to see distillation analyzed geometrically in this setting will get value from the framework and the released code. The combination of theory, multiple backbones, and ablations is enough to send it to peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper develops a theoretical framework for multimodal contrastive finetuning of pretrained models (e.g., CLIP), providing closed-form solutions and a geometric decomposition of regularization strategies. It identifies collapse in standard EMA teachers and proves that a Weighted Moving Average (WMA) teacher maintains persistent orthogonal regularization over finite horizons with bias-free convergence in the task subspace. These insights motivate TRACER, which augments contrastive learning with WMA-guided multi-perspective self-distillation. Experiments across three backbone architectures report consistent gains in OOD accuracy and calibration, with ablations supporting robustness to hyperparameters; code is released.

Significance. If the closed-form derivations and geometric claims hold under realistic finetuning conditions, the work supplies a principled explanation for why self-distillation outperforms other regularizers and directly motivates a practical replacement for EMA. This could influence robust finetuning practices for vision-language models where OOD reliability matters. Reproducibility via the linked GitHub repository is a clear strength.

major comments (2)

[theoretical framework] Theoretical framework (closed-form solutions and geometric decomposition): the derivations establishing EMA collapse and WMA's persistent orthogonal force appear to rely on assumptions such as linearity of the encoder dynamics, infinite data, or exact task-subspace orthogonality. These may not hold for CLIP's non-linear encoders, batch noise, or finite-horizon training; if violated, the claimed superiority of WMA over EMA does not necessarily follow even if empirical gains are observed.
[theoretical framework and § on TRACER] Motivation for TRACER (WMA-guided multi-perspective distillation): the central claim that this combination yields consistent OOD gains rests on the geometric decomposition accurately capturing each regularization strategy. Without explicit verification that the derived bias-free convergence and persistent force survive the non-linear, stochastic setting of actual CLIP finetuning, the link between theory and the reported experimental improvements remains unestablished.

minor comments (1)

[abstract] The abstract and introduction would benefit from a brief statement of the precise assumptions under which the closed-form solutions are derived (e.g., linearity, data regime).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the theoretical framework and its connection to TRACER. We address each point below, clarifying the role of the assumptions and the intended link to experiments.

read point-by-point responses

Referee: Theoretical framework (closed-form solutions and geometric decomposition): the derivations establishing EMA collapse and WMA's persistent orthogonal force appear to rely on assumptions such as linearity of the encoder dynamics, infinite data, or exact task-subspace orthogonality. These may not hold for CLIP's non-linear encoders, batch noise, or finite-horizon training; if violated, the claimed superiority of WMA over EMA does not necessarily follow even if empirical gains are observed.

Authors: The closed-form solutions and geometric decomposition are derived under linear dynamics and infinite-data assumptions to obtain exact characterizations of regularization behavior; this is a standard approach for analytical insight. We agree these assumptions are idealized relative to non-linear CLIP encoders and stochastic batch training. The framework is presented as providing intuition and motivation rather than a complete non-linear proof. The empirical results on real CLIP models are offered as supporting evidence that the predicted advantages of WMA materialize in practice. We will add an explicit limitations paragraph discussing the scope of the assumptions. revision: partial
Referee: Motivation for TRACER (WMA-guided multi-perspective distillation): the central claim that this combination yields consistent OOD gains rests on the geometric decomposition accurately capturing each regularization strategy. Without explicit verification that the derived bias-free convergence and persistent force survive the non-linear, stochastic setting of actual CLIP finetuning, the link between theory and the reported experimental improvements remains unestablished.

Authors: The theory supplies a principled motivation for replacing EMA with WMA and for the multi-perspective distillation design; the experiments (including cross-architecture ablations) then test whether these design choices deliver the expected OOD and calibration benefits. We do not claim a rigorous non-linear stochastic proof of the derived quantities. The consistent gains and hyperparameter robustness are presented as empirical corroboration of the framework's practical utility. We will revise the motivation section to state more clearly that the geometric analysis provides intuition while the experiments serve as the primary validation. revision: partial

Circularity Check

0 steps flagged

No circularity detected; theoretical claims presented as independent derivations

full rationale

The provided abstract and context describe a theoretical framework yielding closed-form solutions and geometric decompositions for regularization strategies in multimodal contrastive finetuning, including analysis of EMA collapse and WMA persistence. No equations, self-citations, fitted parameters renamed as predictions, or self-referential definitions are exhibited in the text. The derivation chain is presented as first-principles analysis without reduction to inputs by construction, and the central empirical claims rest on experiments rather than tautological predictions. This is the expected honest non-finding when no load-bearing circular steps can be quoted.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit details on free parameters, axioms, or invented entities; full manuscript unavailable for audit.

pith-pipeline@v0.9.1-grok · 5794 in / 1056 out tokens · 27380 ms · 2026-06-29T08:48:28.331192+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Zavrtanik, M

IEEE, 2021. doi: 10.1109/ICCV48922.2021.00951. URL https://doi.org/10.1109/ICCV48922. 2021.00951. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248...

work page doi:10.1109/iccv48922.2021.00951 2021
[2]

Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L

URL https://openreview.net/forum? id=YicbFdNTTy. Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L. Data determines distri- butional robustness in contrastive language image pre- training (CLIP). InInternational Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 ofProcee...

2022
[3]

Black, and Otmar Hilliges

URL https://proceedings.mlr.press/ v162/fang22a.html. 10 TRACER: Persistent Regularization for Robust Multimodal Finetuning Fang, A., Jose, A. M., Jain, A., Schmidt, L., Toshev, A. T., and Shankar, V . Data filtering networks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, ...

work page doi:10.1109/cvpr52729.2023.01855 2024
[4]

Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y

URL http://proceedings.mlr.press/ v80/furlanello18a.html. Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y . On the duality between contrastive and non- contrastive self-supervised learning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe- view.net, 2023. URL https://openr...

work page doi:10.1109/cvpr52729 2023
[5]

HaoChen, J

URL https://openreview.net/forum? id=AuEgNlEAmed. HaoChen, J. Z., Wei, C., Kumar, A., and Ma, T. Beyond separability: Analyzing the linear transferability of con- trastive representations to related subpopulations. InAd- vances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New O...

work page doi:10.1109/cvpr42600 2022
[6]

URL http://proceedings.mlr.press/ v97/houlsby19a.html. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adapta- tion of large language models. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

2022
[7]

Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L

URL https://openreview.net/forum? id=nZeVKeeFYf9. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. Openclip, July 2021. URL https://doi.org/10. 5281/zenodo.5143773. If you use this software, please cite it as below. Izmailov, P., Po...

2021
[8]

cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper. pdf. Jang, D., Yun, S., and Han, D. Model stock: All we need is just a few fine-tuned models. InComputer Vi- sion - ECCV 2024 - 18th European Conference, Mi- lan, Italy, September 29-October 4, 2024, Proceed- ings, Part XLIV, volume 15102 ofLecture Not...

work page doi:10.1007/978-3-031-72784-9 2018
[9]

Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S

URL http://proceedings.mlr.press/ v139/jia21b.html. Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S. J., Hariharan, B., and Lim, S. Visual prompt tuning. In Avidan, S., Brostow, G. J., Cissé, M., Farinella, G. M., and Hassner, T. (eds.),Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX...

work page doi:10.1007/978-3-031-19827-4 2022
[10]

Kumar, A., Raghunathan, A., Jones, R

URL http://proceedings.mlr.press/ v97/kornblith19a.html. Kumar, A., Raghunathan, A., Jones, R. M., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

2022
[11]

Lecun, L

URL https://openreview.net/forum? id=UYneFzXSJWh. Laine, S. and Aila, T. Temporal ensembling for semi- supervised learning. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, 2017. URL https://openreview. net/forum?id=BJ6oOfqge. LeCun, Y ., Bottou, L., B...

work page doi:10.1109/5.726791 2017
[12]

In: IEEE/CVF International Conference on Computer Vision

PMLR, 2018. URL http://proceedings. mlr.press/v80/li18a.html. Li, X., Fang, Y ., Liu, M., Ling, Z., Tu, Z., and Su, H. Distill- ing large vision-language model with out-of-distribution generalizability. InIEEE/CVF International Confer- ence on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 2492–2503. IEEE, 2023a. doi: 10.1109/ICCV51070....

work page doi:10.1109/iccv51070.2023.00236 2018
[13]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

URL https://doi.org/10.18653/v1/ 2021.acl-long.353. Li, Y ., Fan, H., Hu, R., Feichtenhofer, C., and He, K. Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 23390–23400. IEEE, 2023c. doi: 10.1109/CVPR52729.2023.02240. URL https:// d...

work page doi:10.18653/v1/ 2021
[14]

and Hoiem, D

doi: 10.1109/TPAMI.2017.2773081. URL https: //doi.org/10.1109/TPAMI.2017.2773081. Li, Z., Li, X., Fu, X., Zhang, X., Wang, W., Chen, S., and Yang, J. Promptkd: Unsupervised prompt distillation for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pp. 26607–26616. IEEE, ...

work page doi:10.1109/tpami.2017.2773081 2017
[15]

URL https://proceedings

PMLR, 2023. URL https://proceedings. mlr.press/v206/nakada23a.html. Nam, G., Heo, B., and Lee, J. Lipsum-ft: Robust fine-tuning of zero-shot models using random text guidance. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview. net/forum?id=2JF8mJRJ7...

2023
[16]

Radford, A., Kim, J

URL https://openreview.net/forum? id=M0MF4t3hE9. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning trans- ferable visual models from natural language supervi- sion. InProceedings of the 38th International Con- ference on Machine Learning, ICML 202...

work page doi:10.1109/cvpr.2017.587 2021
[17]

Progressive Neural Networks

URL http://proceedings.mlr.press/ v97/recht19a.html. Robins, A. Catastrophic forgetting, rehearsal and pseudore- hearsal.Connection Science, 7(2):123–146, 1995. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei-Fei, L. Ima- genet large scale visual recognition c...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-030-58598-3_10 1995
[18]

URL http://proceedings.mlr.press/ v97/saunshi19a.html. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crow- son, K., Schmidt, L., Kaczmarczyk, R., and Jitsev, J. LAION-5B: an open large-scale dataset for training next generation image-text mode...

2022
[20]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

URL https://openreview.net/forum? id=HkxCzeHFDB. Tschannen, M., Gritsenko, A. A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., Hénaff, O. J., Harmsen, J., Steiner, A., and Zhai, X. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense fe...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

2026 Sulfur fractionation in coronal plumes as observed by Solar Orbiter/SPICE

URL http://proceedings.mlr.press/ v119/wang20k.html. Wang, Z., Codella, N., Chen, Y ., Zhou, L., Dai, X., Xiao, B., Yang, J., You, H., Chang, K., Chang, S., and Yuan, L. Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. CoRR, abs/2204.10496, 2022b. doi: 10.48550/ARXIV . 2204.10496. URLhttps://doi.org/10.48550/ ar...

work page internal anchor Pith review doi:10.48550/arxiv 2022
[23]

Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B

URL https://proceedings.mlr.press/ v202/xue23d.html. Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B. Un- derstanding the robustness of multi-modal contrastive learning to distribution shift. InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024
[24]

Reiss, N

URL https://openreview.net/forum? id=rtl4XnJYBh. Yan, S., Xie, J., and He, X. DER: dynamically expandable representation for class incremental learning. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 3014–3023. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00303. URL https: //...

work page doi:10.1109/cvpr46437.2021.00303 2021
[25]

Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B

URL https://openreview.net/forum? id=Ee277P3AYC. Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B. Multimodal contrastive training for visual representation learning. InIEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2021, virtual, June 19-25, 2021, pp. 6995–7004. Computer Vision Foundation / IEEE,

2021
[26]

Reiss, N

doi: 10.1109/CVPR46437.2021.00692. URL https://openaccess.thecvf. com/content/CVPR2021/html/Yuan_ Multimodal_Contrastive_Training_for_ Visual_Representation_Learning_CVPR_ 2021_paper.html. Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. InProceed- ings of the 34th International Conference on Ma- chine Learning, ICML...

work page doi:10.1109/cvpr46437.2021.00692 2021
[27]

Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L

URL http://proceedings.mlr.press/ v70/zenke17a.html. Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vi- sion, ICCV 2023, Paris, France, October 1-6, 2023, pp. 11941–11952. IEEE, 2023. doi: 10.1109/ICCV51070. 2023.01100. URL https://doi.org/10.1109/ ICCV510...

work page doi:10.1109/iccv51070 2023
[28]

modifies the contrastive loss by using a sigmoid function instead of softmax, and FLIP (Li et al., 2023c) integrates masking strategies to accelerate training. A.2. Theory of Contrastive Learning A rich theoretical literature analyzes contrastive learning from first principles, characterizing when and why contrastive objectives recover useful features and...

work page arXiv 2019
[29]

This holds if and only if the rows of A are in the null space of CI

Null Space Component:The null space of Q consists of matrices A such that Q(A) =AC I =0 . This holds if and only if the rows of A are in the null space of CI. The orthogonal projector onto this component of the initial matrix W0 I is ΠQ⊥(W0 I) =W 0 I(I− P I), whereP I =C IC+ I is the projector onto the row space ofX I. This component is preserved
[30]

This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I

Range Component:The pseudoinverse Q+ finds the minimum Frobenius norm solution to Q(W) =P that lies in Range(Q). This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I . Combining the components gives the final solution: WFT =W 0 I(I− P I) +Y FTX⊤ I (XIX⊤ I )+. L2 Regularization.The objective L(WI) = 1 2 ∥WIXI −Y FT∥2 F + λ 2 WI −W 0 I 2 ...
[31]

best of both worlds

Range Component:The pseudoinverse is Q+ SD = 1 1+λ Q+, where Q+ corresponds to the direct finetuning case. Applying it to PSD: Q+ SD(PSD) = 1 1 +λ Q+ YFTX⊤ I +λW 0 ICI = 1 1 +λ (YFTX⊤ I )C+ I +λQ +(Q(W0 I)) = 1 1 +λ (YFTX⊤ I )C+ I +λΠ Q(W0 I) = 1 1 +λ YFTX⊤ I (XIX⊤ I )+ +λW 0 I PI . 29 TRACER: Persistent Regularization for Robust Multimodal Finetuning Com...

2018

[1] [1]

Zavrtanik, M

IEEE, 2021. doi: 10.1109/ICCV48922.2021.00951. URL https://doi.org/10.1109/ICCV48922. 2021.00951. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248...

work page doi:10.1109/iccv48922.2021.00951 2021

[2] [2]

Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L

URL https://openreview.net/forum? id=YicbFdNTTy. Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L. Data determines distri- butional robustness in contrastive language image pre- training (CLIP). InInternational Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 ofProcee...

2022

[3] [3]

Black, and Otmar Hilliges

URL https://proceedings.mlr.press/ v162/fang22a.html. 10 TRACER: Persistent Regularization for Robust Multimodal Finetuning Fang, A., Jose, A. M., Jain, A., Schmidt, L., Toshev, A. T., and Shankar, V . Data filtering networks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, ...

work page doi:10.1109/cvpr52729.2023.01855 2024

[4] [4]

Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y

URL http://proceedings.mlr.press/ v80/furlanello18a.html. Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y . On the duality between contrastive and non- contrastive self-supervised learning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe- view.net, 2023. URL https://openr...

work page doi:10.1109/cvpr52729 2023

[5] [5]

HaoChen, J

URL https://openreview.net/forum? id=AuEgNlEAmed. HaoChen, J. Z., Wei, C., Kumar, A., and Ma, T. Beyond separability: Analyzing the linear transferability of con- trastive representations to related subpopulations. InAd- vances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New O...

work page doi:10.1109/cvpr42600 2022

[6] [6]

URL http://proceedings.mlr.press/ v97/houlsby19a.html. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adapta- tion of large language models. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

2022

[7] [7]

Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L

URL https://openreview.net/forum? id=nZeVKeeFYf9. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. Openclip, July 2021. URL https://doi.org/10. 5281/zenodo.5143773. If you use this software, please cite it as below. Izmailov, P., Po...

2021

[8] [8]

cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper. pdf. Jang, D., Yun, S., and Han, D. Model stock: All we need is just a few fine-tuned models. InComputer Vi- sion - ECCV 2024 - 18th European Conference, Mi- lan, Italy, September 29-October 4, 2024, Proceed- ings, Part XLIV, volume 15102 ofLecture Not...

work page doi:10.1007/978-3-031-72784-9 2018

[9] [9]

Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S

URL http://proceedings.mlr.press/ v139/jia21b.html. Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S. J., Hariharan, B., and Lim, S. Visual prompt tuning. In Avidan, S., Brostow, G. J., Cissé, M., Farinella, G. M., and Hassner, T. (eds.),Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX...

work page doi:10.1007/978-3-031-19827-4 2022

[10] [10]

Kumar, A., Raghunathan, A., Jones, R

URL http://proceedings.mlr.press/ v97/kornblith19a.html. Kumar, A., Raghunathan, A., Jones, R. M., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,

2022

[11] [11]

Lecun, L

URL https://openreview.net/forum? id=UYneFzXSJWh. Laine, S. and Aila, T. Temporal ensembling for semi- supervised learning. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, 2017. URL https://openreview. net/forum?id=BJ6oOfqge. LeCun, Y ., Bottou, L., B...

work page doi:10.1109/5.726791 2017

[12] [12]

In: IEEE/CVF International Conference on Computer Vision

PMLR, 2018. URL http://proceedings. mlr.press/v80/li18a.html. Li, X., Fang, Y ., Liu, M., Ling, Z., Tu, Z., and Su, H. Distill- ing large vision-language model with out-of-distribution generalizability. InIEEE/CVF International Confer- ence on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 2492–2503. IEEE, 2023a. doi: 10.1109/ICCV51070....

work page doi:10.1109/iccv51070.2023.00236 2018

[13] [13]

NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails

URL https://doi.org/10.18653/v1/ 2021.acl-long.353. Li, Y ., Fan, H., Hu, R., Feichtenhofer, C., and He, K. Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 23390–23400. IEEE, 2023c. doi: 10.1109/CVPR52729.2023.02240. URL https:// d...

work page doi:10.18653/v1/ 2021

[14] [14]

and Hoiem, D

doi: 10.1109/TPAMI.2017.2773081. URL https: //doi.org/10.1109/TPAMI.2017.2773081. Li, Z., Li, X., Fu, X., Zhang, X., Wang, W., Chen, S., and Yang, J. Promptkd: Unsupervised prompt distillation for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pp. 26607–26616. IEEE, ...

work page doi:10.1109/tpami.2017.2773081 2017

[15] [15]

URL https://proceedings

PMLR, 2023. URL https://proceedings. mlr.press/v206/nakada23a.html. Nam, G., Heo, B., and Lee, J. Lipsum-ft: Robust fine-tuning of zero-shot models using random text guidance. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview. net/forum?id=2JF8mJRJ7...

2023

[16] [16]

Radford, A., Kim, J

URL https://openreview.net/forum? id=M0MF4t3hE9. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning trans- ferable visual models from natural language supervi- sion. InProceedings of the 38th International Con- ference on Machine Learning, ICML 202...

work page doi:10.1109/cvpr.2017.587 2021

[17] [17]

Progressive Neural Networks

URL http://proceedings.mlr.press/ v97/recht19a.html. Robins, A. Catastrophic forgetting, rehearsal and pseudore- hearsal.Connection Science, 7(2):123–146, 1995. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei-Fei, L. Ima- genet large scale visual recognition c...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-030-58598-3_10 1995

[18] [18]

URL http://proceedings.mlr.press/ v97/saunshi19a.html. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crow- son, K., Schmidt, L., Kaczmarczyk, R., and Jitsev, J. LAION-5B: an open large-scale dataset for training next generation image-text mode...

2022

[19] [20]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

URL https://openreview.net/forum? id=HkxCzeHFDB. Tschannen, M., Gritsenko, A. A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., Hénaff, O. J., Harmsen, J., Steiner, A., and Zhai, X. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense fe...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [22]

2026 Sulfur fractionation in coronal plumes as observed by Solar Orbiter/SPICE

URL http://proceedings.mlr.press/ v119/wang20k.html. Wang, Z., Codella, N., Chen, Y ., Zhou, L., Dai, X., Xiao, B., Yang, J., You, H., Chang, K., Chang, S., and Yuan, L. Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. CoRR, abs/2204.10496, 2022b. doi: 10.48550/ARXIV . 2204.10496. URLhttps://doi.org/10.48550/ ar...

work page internal anchor Pith review doi:10.48550/arxiv 2022

[21] [23]

Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B

URL https://proceedings.mlr.press/ v202/xue23d.html. Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B. Un- derstanding the robustness of multi-modal contrastive learning to distribution shift. InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

2024

[22] [24]

Reiss, N

URL https://openreview.net/forum? id=rtl4XnJYBh. Yan, S., Xie, J., and He, X. DER: dynamically expandable representation for class incremental learning. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 3014–3023. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00303. URL https: //...

work page doi:10.1109/cvpr46437.2021.00303 2021

[23] [25]

Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B

URL https://openreview.net/forum? id=Ee277P3AYC. Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B. Multimodal contrastive training for visual representation learning. InIEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2021, virtual, June 19-25, 2021, pp. 6995–7004. Computer Vision Foundation / IEEE,

2021

[24] [26]

Reiss, N

doi: 10.1109/CVPR46437.2021.00692. URL https://openaccess.thecvf. com/content/CVPR2021/html/Yuan_ Multimodal_Contrastive_Training_for_ Visual_Representation_Learning_CVPR_ 2021_paper.html. Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. InProceed- ings of the 34th International Conference on Ma- chine Learning, ICML...

work page doi:10.1109/cvpr46437.2021.00692 2021

[25] [27]

Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L

URL http://proceedings.mlr.press/ v70/zenke17a.html. Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vi- sion, ICCV 2023, Paris, France, October 1-6, 2023, pp. 11941–11952. IEEE, 2023. doi: 10.1109/ICCV51070. 2023.01100. URL https://doi.org/10.1109/ ICCV510...

work page doi:10.1109/iccv51070 2023

[26] [28]

modifies the contrastive loss by using a sigmoid function instead of softmax, and FLIP (Li et al., 2023c) integrates masking strategies to accelerate training. A.2. Theory of Contrastive Learning A rich theoretical literature analyzes contrastive learning from first principles, characterizing when and why contrastive objectives recover useful features and...

work page arXiv 2019

[27] [29]

This holds if and only if the rows of A are in the null space of CI

Null Space Component:The null space of Q consists of matrices A such that Q(A) =AC I =0 . This holds if and only if the rows of A are in the null space of CI. The orthogonal projector onto this component of the initial matrix W0 I is ΠQ⊥(W0 I) =W 0 I(I− P I), whereP I =C IC+ I is the projector onto the row space ofX I. This component is preserved

[28] [30]

This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I

Range Component:The pseudoinverse Q+ finds the minimum Frobenius norm solution to Q(W) =P that lies in Range(Q). This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I . Combining the components gives the final solution: WFT =W 0 I(I− P I) +Y FTX⊤ I (XIX⊤ I )+. L2 Regularization.The objective L(WI) = 1 2 ∥WIXI −Y FT∥2 F + λ 2 WI −W 0 I 2 ...

[29] [31]

best of both worlds

Range Component:The pseudoinverse is Q+ SD = 1 1+λ Q+, where Q+ corresponds to the direct finetuning case. Applying it to PSD: Q+ SD(PSD) = 1 1 +λ Q+ YFTX⊤ I +λW 0 ICI = 1 1 +λ (YFTX⊤ I )C+ I +λQ +(Q(W0 I)) = 1 1 +λ (YFTX⊤ I )C+ I +λΠ Q(W0 I) = 1 1 +λ YFTX⊤ I (XIX⊤ I )+ +λW 0 I PI . 29 TRACER: Persistent Regularization for Robust Multimodal Finetuning Com...

2018