TRACER: Persistent Regularization for Robust Multimodal Finetuning
Pith reviewed 2026-06-29 08:48 UTC · model grok-4.3
The pith
A weighted moving average teacher prevents collapse and enables persistent regularization in multimodal contrastive finetuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TRACER combines contrastive learning with WMA-guided multi-perspective distillation to achieve consistent OOD accuracy and calibration gains. The theoretical framework shows self-distillation is more effective than other approaches, EMA teachers collapse, but WMA maintains persistent force and bias-free convergence while preserving orthogonal knowledge. Experiments on CLIP finetuning across three backbones confirm the gains and robustness to hyperparameters.
What carries the argument
The WMA teacher that maintains a persistent regularizing force over finite horizons, yields bias-free convergence in the task subspace, and preserves orthogonal knowledge.
If this is right
- Self-distillation outperforms other regularization approaches in retaining pretrained model knowledge.
- TRACER yields consistent OOD accuracy and calibration improvements across multiple backbone architectures.
- The method is robust to hyperparameter choices.
- Geometric decomposition provides closed-form solutions for analyzing each strategy.
Where Pith is reading between the lines
- Similar persistent regularization techniques could apply to other types of model adaptation beyond multimodal contrastive finetuning.
- The analysis of teacher collapse might inform designs in knowledge distillation for other domains like natural language processing.
- Testing the framework on additional datasets or tasks could reveal further benefits or limitations.
Load-bearing premise
The geometric decomposition and closed-form solutions accurately capture the dynamics of regularization strategies including EMA collapse and WMA persistent force.
What would settle it
Observing that an EMA teacher does not collapse or that a WMA teacher fails to provide persistent regularization over finite horizons in experiments would falsify the theoretical claims.
Figures
read the original abstract
Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a theoretical framework for multimodal contrastive finetuning of pretrained models (e.g., CLIP), providing closed-form solutions and a geometric decomposition of regularization strategies. It identifies collapse in standard EMA teachers and proves that a Weighted Moving Average (WMA) teacher maintains persistent orthogonal regularization over finite horizons with bias-free convergence in the task subspace. These insights motivate TRACER, which augments contrastive learning with WMA-guided multi-perspective self-distillation. Experiments across three backbone architectures report consistent gains in OOD accuracy and calibration, with ablations supporting robustness to hyperparameters; code is released.
Significance. If the closed-form derivations and geometric claims hold under realistic finetuning conditions, the work supplies a principled explanation for why self-distillation outperforms other regularizers and directly motivates a practical replacement for EMA. This could influence robust finetuning practices for vision-language models where OOD reliability matters. Reproducibility via the linked GitHub repository is a clear strength.
major comments (2)
- [theoretical framework] Theoretical framework (closed-form solutions and geometric decomposition): the derivations establishing EMA collapse and WMA's persistent orthogonal force appear to rely on assumptions such as linearity of the encoder dynamics, infinite data, or exact task-subspace orthogonality. These may not hold for CLIP's non-linear encoders, batch noise, or finite-horizon training; if violated, the claimed superiority of WMA over EMA does not necessarily follow even if empirical gains are observed.
- [theoretical framework and § on TRACER] Motivation for TRACER (WMA-guided multi-perspective distillation): the central claim that this combination yields consistent OOD gains rests on the geometric decomposition accurately capturing each regularization strategy. Without explicit verification that the derived bias-free convergence and persistent force survive the non-linear, stochastic setting of actual CLIP finetuning, the link between theory and the reported experimental improvements remains unestablished.
minor comments (1)
- [abstract] The abstract and introduction would benefit from a brief statement of the precise assumptions under which the closed-form solutions are derived (e.g., linearity, data regime).
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments on the theoretical framework and its connection to TRACER. We address each point below, clarifying the role of the assumptions and the intended link to experiments.
read point-by-point responses
-
Referee: Theoretical framework (closed-form solutions and geometric decomposition): the derivations establishing EMA collapse and WMA's persistent orthogonal force appear to rely on assumptions such as linearity of the encoder dynamics, infinite data, or exact task-subspace orthogonality. These may not hold for CLIP's non-linear encoders, batch noise, or finite-horizon training; if violated, the claimed superiority of WMA over EMA does not necessarily follow even if empirical gains are observed.
Authors: The closed-form solutions and geometric decomposition are derived under linear dynamics and infinite-data assumptions to obtain exact characterizations of regularization behavior; this is a standard approach for analytical insight. We agree these assumptions are idealized relative to non-linear CLIP encoders and stochastic batch training. The framework is presented as providing intuition and motivation rather than a complete non-linear proof. The empirical results on real CLIP models are offered as supporting evidence that the predicted advantages of WMA materialize in practice. We will add an explicit limitations paragraph discussing the scope of the assumptions. revision: partial
-
Referee: Motivation for TRACER (WMA-guided multi-perspective distillation): the central claim that this combination yields consistent OOD gains rests on the geometric decomposition accurately capturing each regularization strategy. Without explicit verification that the derived bias-free convergence and persistent force survive the non-linear, stochastic setting of actual CLIP finetuning, the link between theory and the reported experimental improvements remains unestablished.
Authors: The theory supplies a principled motivation for replacing EMA with WMA and for the multi-perspective distillation design; the experiments (including cross-architecture ablations) then test whether these design choices deliver the expected OOD and calibration benefits. We do not claim a rigorous non-linear stochastic proof of the derived quantities. The consistent gains and hyperparameter robustness are presented as empirical corroboration of the framework's practical utility. We will revise the motivation section to state more clearly that the geometric analysis provides intuition while the experiments serve as the primary validation. revision: partial
Circularity Check
No circularity detected; theoretical claims presented as independent derivations
full rationale
The provided abstract and context describe a theoretical framework yielding closed-form solutions and geometric decompositions for regularization strategies in multimodal contrastive finetuning, including analysis of EMA collapse and WMA persistence. No equations, self-citations, fitted parameters renamed as predictions, or self-referential definitions are exhibited in the text. The derivation chain is presented as first-principles analysis without reduction to inputs by construction, and the central empirical claims rest on experiments rather than tautological predictions. This is the expected honest non-finding when no load-bearing circular steps can be quoted.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
IEEE, 2021. doi: 10.1109/ICCV48922.2021.00951. URL https://doi.org/10.1109/ICCV48922. 2021.00951. Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp. 248...
-
[2]
Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L
URL https://openreview.net/forum? id=YicbFdNTTy. Fang, A., Ilharco, G., Wortsman, M., Wan, Y ., Shankar, V ., Dave, A., and Schmidt, L. Data determines distri- butional robustness in contrastive language image pre- training (CLIP). InInternational Conference on Ma- chine Learning, ICML 2022, 17-23 July 2022, Balti- more, Maryland, USA, volume 162 ofProcee...
2022
-
[3]
URL https://proceedings.mlr.press/ v162/fang22a.html. 10 TRACER: Persistent Regularization for Robust Multimodal Finetuning Fang, A., Jose, A. M., Jain, A., Schmidt, L., Toshev, A. T., and Shankar, V . Data filtering networks. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net, ...
-
[4]
Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y
URL http://proceedings.mlr.press/ v80/furlanello18a.html. Garrido, Q., Chen, Y ., Bardes, A., Najman, L., and Le- Cun, Y . On the duality between contrastive and non- contrastive self-supervised learning. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe- view.net, 2023. URL https://openr...
-
[5]
URL https://openreview.net/forum? id=AuEgNlEAmed. HaoChen, J. Z., Wei, C., Kumar, A., and Ma, T. Beyond separability: Analyzing the linear transferability of con- trastive representations to related subpopulations. InAd- vances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New O...
-
[6]
URL http://proceedings.mlr.press/ v97/houlsby19a.html. Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adapta- tion of large language models. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,
2022
-
[7]
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L
URL https://openreview.net/forum? id=nZeVKeeFYf9. Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Car- lini, N., Taori, R., Dave, A., Shankar, V ., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., and Schmidt, L. Openclip, July 2021. URL https://doi.org/10. 5281/zenodo.5143773. If you use this software, please cite it as below. Izmailov, P., Po...
2021
-
[8]
cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper
URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper. pdf. Jang, D., Yun, S., and Han, D. Model stock: All we need is just a few fine-tuned models. InComputer Vi- sion - ECCV 2024 - 18th European Conference, Mi- lan, Italy, September 29-October 4, 2024, Proceed- ings, Part XLIV, volume 15102 ofLecture Not...
-
[9]
Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S
URL http://proceedings.mlr.press/ v139/jia21b.html. Jia, M., Tang, L., Chen, B., Cardie, C., Belongie, S. J., Hariharan, B., and Lim, S. Visual prompt tuning. In Avidan, S., Brostow, G. J., Cissé, M., Farinella, G. M., and Hassner, T. (eds.),Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XX...
-
[10]
Kumar, A., Raghunathan, A., Jones, R
URL http://proceedings.mlr.press/ v97/kornblith19a.html. Kumar, A., Raghunathan, A., Jones, R. M., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. InThe Tenth Interna- tional Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net,
2022
-
[11]
URL https://openreview.net/forum? id=UYneFzXSJWh. Laine, S. and Aila, T. Temporal ensembling for semi- supervised learning. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open- Review.net, 2017. URL https://openreview. net/forum?id=BJ6oOfqge. LeCun, Y ., Bottou, L., B...
-
[12]
In: IEEE/CVF International Conference on Computer Vision
PMLR, 2018. URL http://proceedings. mlr.press/v80/li18a.html. Li, X., Fang, Y ., Liu, M., Ling, Z., Tu, Z., and Su, H. Distill- ing large vision-language model with out-of-distribution generalizability. InIEEE/CVF International Confer- ence on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pp. 2492–2503. IEEE, 2023a. doi: 10.1109/ICCV51070....
-
[13]
NeMo guardrails: A toolkit for controllable and safe LLM applications with pro- grammable rails
URL https://doi.org/10.18653/v1/ 2021.acl-long.353. Li, Y ., Fan, H., Hu, R., Feichtenhofer, C., and He, K. Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp. 23390–23400. IEEE, 2023c. doi: 10.1109/CVPR52729.2023.02240. URL https:// d...
-
[14]
doi: 10.1109/TPAMI.2017.2773081. URL https: //doi.org/10.1109/TPAMI.2017.2773081. Li, Z., Li, X., Fu, X., Zhang, X., Wang, W., Chen, S., and Yang, J. Promptkd: Unsupervised prompt distillation for vision-language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pp. 26607–26616. IEEE, ...
-
[15]
URL https://proceedings
PMLR, 2023. URL https://proceedings. mlr.press/v206/nakada23a.html. Nam, G., Heo, B., and Lee, J. Lipsum-ft: Robust fine-tuning of zero-shot models using random text guidance. InThe Twelfth International Conference on Learning Represen- tations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URLhttps://openreview. net/forum?id=2JF8mJRJ7...
2023
-
[16]
URL https://openreview.net/forum? id=M0MF4t3hE9. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning trans- ferable visual models from natural language supervi- sion. InProceedings of the 38th International Con- ference on Machine Learning, ICML 202...
-
[17]
URL http://proceedings.mlr.press/ v97/recht19a.html. Robins, A. Catastrophic forgetting, rehearsal and pseudore- hearsal.Connection Science, 7(2):123–146, 1995. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Fei-Fei, L. Ima- genet large scale visual recognition c...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-030-58598-3_10 1995
-
[18]
URL http://proceedings.mlr.press/ v97/saunshi19a.html. Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., Schramowski, P., Kundurthy, S., Crow- son, K., Schmidt, L., Kaczmarczyk, R., and Jitsev, J. LAION-5B: an open large-scale dataset for training next generation image-text mode...
2022
-
[20]
URL https://openreview.net/forum? id=HkxCzeHFDB. Tschannen, M., Gritsenko, A. A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., Hénaff, O. J., Harmsen, J., Steiner, A., and Zhai, X. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense fe...
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
2026 Sulfur fractionation in coronal plumes as observed by Solar Orbiter/SPICE
URL http://proceedings.mlr.press/ v119/wang20k.html. Wang, Z., Codella, N., Chen, Y ., Zhou, L., Dai, X., Xiao, B., Yang, J., You, H., Chang, K., Chang, S., and Yuan, L. Multimodal adaptive distillation for leveraging unimodal encoders for vision-language tasks. CoRR, abs/2204.10496, 2022b. doi: 10.48550/ARXIV . 2204.10496. URLhttps://doi.org/10.48550/ ar...
work page internal anchor Pith review doi:10.48550/arxiv 2022
-
[23]
Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B
URL https://proceedings.mlr.press/ v202/xue23d.html. Xue, Y ., Joshi, S., Nguyen, D., and Mirzasoleiman, B. Un- derstanding the robustness of multi-modal contrastive learning to distribution shift. InThe Twelfth Interna- tional Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,
2024
-
[24]
URL https://openreview.net/forum? id=rtl4XnJYBh. Yan, S., Xie, J., and He, X. DER: dynamically expandable representation for class incremental learning. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pp. 3014–3023. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.00303. URL https: //...
-
[25]
Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B
URL https://openreview.net/forum? id=Ee277P3AYC. Yuan, X., Lin, Z., Kuen, J., Zhang, J., Wang, Y ., Maire, M., Kale, A., and Faieta, B. Multimodal contrastive training for visual representation learning. InIEEE Conference on Computer Vision and Pattern Recog- nition, CVPR 2021, virtual, June 19-25, 2021, pp. 6995–7004. Computer Vision Foundation / IEEE,
2021
-
[26]
doi: 10.1109/CVPR46437.2021.00692. URL https://openaccess.thecvf. com/content/CVPR2021/html/Yuan_ Multimodal_Contrastive_Training_for_ Visual_Representation_Learning_CVPR_ 2021_paper.html. Zenke, F., Poole, B., and Ganguli, S. Continual learning through synaptic intelligence. InProceed- ings of the 34th International Conference on Ma- chine Learning, ICML...
-
[27]
Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L
URL http://proceedings.mlr.press/ v70/zenke17a.html. Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vi- sion, ICCV 2023, Paris, France, October 1-6, 2023, pp. 11941–11952. IEEE, 2023. doi: 10.1109/ICCV51070. 2023.01100. URL https://doi.org/10.1109/ ICCV510...
-
[28]
modifies the contrastive loss by using a sigmoid function instead of softmax, and FLIP (Li et al., 2023c) integrates masking strategies to accelerate training. A.2. Theory of Contrastive Learning A rich theoretical literature analyzes contrastive learning from first principles, characterizing when and why contrastive objectives recover useful features and...
-
[29]
This holds if and only if the rows of A are in the null space of CI
Null Space Component:The null space of Q consists of matrices A such that Q(A) =AC I =0 . This holds if and only if the rows of A are in the null space of CI. The orthogonal projector onto this component of the initial matrix W0 I is ΠQ⊥(W0 I) =W 0 I(I− P I), whereP I =C IC+ I is the projector onto the row space ofX I. This component is preserved
-
[30]
This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I
Range Component:The pseudoinverse Q+ finds the minimum Frobenius norm solution to Q(W) =P that lies in Range(Q). This is the solution toWC I =Y FTX⊤ I , which isQ +(P) = (YFTX⊤ I )C+ I . Combining the components gives the final solution: WFT =W 0 I(I− P I) +Y FTX⊤ I (XIX⊤ I )+. L2 Regularization.The objective L(WI) = 1 2 ∥WIXI −Y FT∥2 F + λ 2 WI −W 0 I 2 ...
-
[31]
best of both worlds
Range Component:The pseudoinverse is Q+ SD = 1 1+λ Q+, where Q+ corresponds to the direct finetuning case. Applying it to PSD: Q+ SD(PSD) = 1 1 +λ Q+ YFTX⊤ I +λW 0 ICI = 1 1 +λ (YFTX⊤ I )C+ I +λQ +(Q(W0 I)) = 1 1 +λ (YFTX⊤ I )C+ I +λΠ Q(W0 I) = 1 1 +λ YFTX⊤ I (XIX⊤ I )+ +λW 0 I PI . 29 TRACER: Persistent Regularization for Robust Multimodal Finetuning Com...
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.