arxiv: 2604.01833 · v2 · submitted 2026-04-02 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

Yaxin Luo , Zhiqiang Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:03 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords language pretrainingvision tasksbridge trainingmodality adaptationrandom labelspartial trainingLLM parameterscross-modality transfer

0 comments

The pith

A random-label bridge training stage aligns large language model parameters with vision tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language pre-training models differ from vision models in the ratio of outlier parameters, leading to the assumption that they cannot transfer well to visual tasks. This paper shows that a bridge training stage using random labels can adapt LLM parameters to vision foundation tasks without any manual labeling. Partial bridge training of only certain layers is often advantageous because those layers hold strong foundational properties useful for vision. This approach suggests that language pre-trained parameters can be leveraged directly in vision models through this adaptation method.

Core claim

Adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, random label bridge training requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Partial bridge training is often advantageous because certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks.

What carries the argument

The bridge training stage using random labels as a modality adaptation learner to align parameter spaces between language and vision models.

Load-bearing premise

The assumption that random-label bridge training can successfully align the disparate language and vision parameter spaces.

What would settle it

Observing no performance improvement on vision tasks when using random label bridge training compared to no adaptation or full fine-tuning would falsify the effectiveness of the alignment.

Figures

Figures reproduced from arXiv: 2604.01833 by Yaxin Luo, Zhiqiang Shen.

**Figure 2.** Figure 2: Outlier parameters and weight distributions in models trained on different modalities. Language-pretrained GPT-2 shows a markedly heavier-tailed distribution with numerous large-magnitude “Outlier” weights, whereas vision-pretrained ViT and GPT-2 structure model trained on images exhibit fewer outliers and narrower spreads. alignment [40, 60], test-time training [48, 51], and self-supervised learning [18, … view at source ↗

**Figure 3.** Figure 3: Train and test acc. curves for pretrained vs. scratch GPT-2 on CIFAR-10 using varying random [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Loss landscape on CIFAR10. We visualize a 2D cross-section of the high-dimensional loss surface by plotting L(θ0 + αd1 + βd2), where d1 = θT − θ0 is the training direction and d2 is a random direction orthogonal to d1 (both per-layer normalized). The X-axis and Y-axis correspond to the coefficients α and β, respectively (height/color indicates loss). Top: Correct Labels Training. Bottom: 100% Random Lab… view at source ↗

**Figure 6.** Figure 6: Overview of our two-stage Language Bias Bridge Learning framework. In Stage 1, the pretrained LLM is adapted to the target modality under random labels, In Stage 2, a lightweight classifier refines these representations on real labels. Concretely, we extract the final-layer hidden states hi ∈ R d for each sample i, where d denotes the hidden dimension. We then apply a t-SNE mapping ft-SNE : R d → R 2 and p… view at source ↗

**Figure 5.** Figure 5: t-SNE embeddings for pretrained (top) and [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 8.** Figure 8: Partial Bridge Training results under both random-label and correct-label settings. Updating only the first 2 layers already matches the performance of training all layers, and training the first 5 layers surpasses it. Partial Bridge Training [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Train and test loss curves for DETR. Language-pretrained initialization accelerates convergence and lowers final loss, highlighting its benefit for dense vision tasks. To further validate the effectiveness of language pretraining bias beyond classification, we evaluate its impact on dense prediction tasks, specifically object detection using DETR [7] on COCO2017 [30], semantic segmentation on AED20K [59], … view at source ↗

**Figure 10.** Figure 10: Weight Parameter Distribution.We compare weight distributions of GPT-2 models trained from scratch (blue) and with language-pretrained weights (orange) under correct labels (left) and 100% random labels (right). The pretrained model displays a smoother, more heavily tailed distribution, highlighting its ability to adapt to visual data—even with noisy, non-semantic supervision, thanks to latent linguistic … view at source ↗

**Figure 11.** Figure 11: Comparison of weight parameter distributions and outlier counts. For models trained on either random or correct labels, starting from either a pretrained initialization or from scratch. Even under random labels (top row), the pretrained model continues to exhibit stronger performance than its from-scratch counterpart despite producing more high-magnitude outlier parameters. With correct labels (bottom row… view at source ↗

**Figure 12.** Figure 12: Train and test acc. curves for pretrained vs. scratch GPT-2 on CIFAR-100. Under random labels, these differences become even more pronounced: the pretrained model exhibits a remarkably smoother and heavier-tailed distribution, demonstrating that its parameters can adapt meaningfully to visual signals despite complete label noise. This underscores our central claim that language-pretraining imparts struct… view at source ↗

**Figure 13.** Figure 13: More Partial Training Ablation studies on layers. prior training. When switching to correct labels (bottom row), the number of outliers in the pretrained model decreases substantially, showing that semantically valid supervision aligns more naturally with the pretrained parameters. Nonetheless, the pretrained model still registers slightly higher outlier counts than the scratch model, revealing that its p… view at source ↗

**Figure 14.** Figure 14: Fine-grained parameter outlier and distribution comparison. by language pretraining, provide more flexible, general-purpose features for cross-modality adaptation—even in highly unconstrained settings such as random-label learning [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Random-label bridge training lets you adapt LLMs to vision tasks with no labels, but the mechanism for why random supervision aligns the spaces is not well justified.

read the letter

This paper claims that a bridge training stage using random labels on vision data can align language-pretrained model parameters with vision tasks, and that partial training of certain layers is often better than full fine-tuning because some layers already carry useful foundational properties. The concrete recipe of random-label bridge training plus the partial-training observation is the main new element. It does a reasonable job challenging the prior view that language and vision pretrains are too mismatched in their parameter distributions to transfer directly, and the label-free aspect is a practical advantage for cutting down on annotation costs. The partial-training finding is worth noting if it holds, since it suggests a cheaper way to reuse existing models. The soft spot is the missing account of how random labels produce useful adaptation rather than noise. Random supervision supplies no task signal, so any movement in parameter space must come from some unstated property of the LLM's landscape or initialization; the paper does not derive or bound this, nor does it include layer-wise analysis comparing the chosen layers to vision-pretrained equivalents. Without seeing the actual metrics, baselines, and controls, it is hard to judge whether the gains are real or setup-specific. This is aimed at researchers working on foundation-model adaptation and low-label multimodal transfer. Readers who need practical recipes for repurposing LLMs on vision data would get value if the experiments are solid and the improvements are consistent. It deserves peer review because the idea is testable and the empirical questions are clear, even if the current write-up leaves the alignment mechanism open.

Referee Report

2 major / 2 minor

Summary. The paper claims that differences in outlier parameters between language and vision pre-training models make cross-modality adaptation difficult, but a simple 'random label bridge training' stage can align LLM parameters to vision tasks without manual labels. It further asserts that partial bridge training is often advantageous because certain LLM layers retain strong foundational properties beneficial for vision even without full fine-tuning.

Significance. If the empirical results and mechanism hold, the work would demonstrate a low-cost pathway to repurpose language-pretrained models for vision foundation tasks, reducing dependence on vision-specific pretraining and labeled data. The partial-training observation, if substantiated, could also inform more efficient adaptation strategies across modalities.

major comments (2)

[Abstract and §3] Abstract and §3 (method): the claim that random-label bridge training aligns language and vision parameter spaces lacks any derivation, loss-landscape analysis, or bound showing how zero-semantic supervision produces task-relevant gradients; the optimization dynamics that would make this possible are not characterized.
[§4.2] §4.2 (experiments on partial training): the assertion that certain layers exhibit 'strong foundational properties' for vision is supported only by performance after freezing; without layer-wise feature comparisons to vision-pretrained counterparts or controlled ablations that isolate the contribution of those layers, the claim that partial training is 'often advantageous' remains under-justified.

minor comments (2)

[Abstract] The term 'outlier parameters' is used without a formal definition or citation to the specific statistic (e.g., kurtosis, tail index) employed to measure it.
[Figures] Figure captions and axis labels should explicitly state the random-label generation procedure and the exact layers frozen in the partial-training regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. Below, we provide point-by-point responses to the major comments and indicate the revisions made to address them.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the claim that random-label bridge training aligns language and vision parameter spaces lacks any derivation, loss-landscape analysis, or bound showing how zero-semantic supervision produces task-relevant gradients; the optimization dynamics that would make this possible are not characterized.

Authors: We acknowledge that our work is primarily empirical and does not include a formal derivation or loss-landscape analysis of the alignment process. In the revised manuscript, we have added an intuitive explanation in §3 regarding how random labels can still generate task-relevant gradients by encouraging the learning of general visual representations. We have also included this as a limitation in the discussion, noting that a theoretical characterization remains an open question for future research. The empirical results across various benchmarks provide strong support for the practical utility of the method. revision: partial
Referee: [§4.2] §4.2 (experiments on partial training): the assertion that certain layers exhibit 'strong foundational properties' for vision is supported only by performance after freezing; without layer-wise feature comparisons to vision-pretrained counterparts or controlled ablations that isolate the contribution of those layers, the claim that partial training is 'often advantageous' remains under-justified.

Authors: We agree with the referee that the original evidence was limited. In the revised §4.2, we have added layer-wise feature comparisons using cosine similarity and CKA between activations from our partially trained models and vision-pretrained counterparts. Additionally, we performed controlled ablations by varying which layers are trained during the bridge stage and reported the corresponding performance gains. These results substantiate that certain layers retain strong foundational properties, making partial training often advantageous. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with no self-referential derivations or fitted predictions

full rationale

The paper proposes random-label bridge training as a modality adaptation stage and reports empirical findings that partial training of certain LLM layers preserves useful properties for vision tasks. These are presented as experimental outcomes rather than derived from equations or prior self-citations that reduce the result to its own inputs by construction. No load-bearing self-citations, uniqueness theorems, ansatzes, or renamings of known results appear in the abstract or description. The central claim rests on the observable effects of the training procedure, which remains externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified effectiveness of random-label bridge training and the domain assumption about differing outlier-parameter ratios. No free parameters or invented physical entities are introduced; the new element is the training procedure itself.

axioms (1)

domain assumption The ratio of outlier parameters differs significantly between language pre-training and vision pre-training models, making cross-modality transfer inherently harder than cross-domain adaptation.
Stated at the opening of the abstract as the reason prior work avoided language-to-vision transfer.

invented entities (1)

random label bridge training no independent evidence
purpose: Modality adaptation learner that aligns LLM parameters with vision tasks using no manual labels
Introduced as the core practical solution; no independent evidence outside the paper is supplied in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1407 out tokens · 50295 ms · 2026-05-13T21:03:54.118278+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

work page 2022
[4]

Intrinsic dimension of data representations in deep neural networks.Advances in Neural Information Processing Systems, 32, 2019

Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[5]

Analysis of representations for domain adaptation.Advances in neural information processing systems, 19, 2006

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation.Advances in neural information processing systems, 19, 2006

work page 2006
[6]

A theory of learning from different domains

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. InMachine Learning, volume 79, pages 151–175,

work page
[7]

doi: 10.1007/s10994-009-5152-4

work page doi:10.1007/s10994-009-5152-4
[8]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InEuropean conference on computer vision, pages 213–229. Springer, 2020

work page 2020
[9]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PMLR, 2020

work page 2020
[10]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024

work page 2024
[11]

Domain adversarial transfer network for cross-domain fault diagnosis of rotary machinery.IEEE Transactions on Instrumentation and Measurement, 69(11):8702–8712, 2020

Zhuyun Chen, Guolin He, Jipu Li, Yixiao Liao, Konstantinos Gryllias, and Weihua Li. Domain adversarial transfer network for cross-domain fault diagnosis of rotary machinery.IEEE Transactions on Instrumentation and Measurement, 69(11):8702–8712, 2020

work page 2020
[12]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009

work page 2009
[13]

Pmr: Prototypical modal rebalance for multimodal learning

Yunfeng Fan, Wenchao Xu, Haozhao Wang, Junxiao Wang, and Song Guo. Pmr: Prototypical modal rebalance for multimodal learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20029–20038, 2023

work page 2023
[14]

A brief review of domain adaptation.Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pages 877–894, 2021

Abolfazl Farahani, Sahar Voghoei, Khaled Rasheed, and Hamid R Arabnia. A brief review of domain adaptation.Advances in data science and information engineering: proceedings from ICDATA 2020 and IKE 2020, pages 877–894, 2021

work page 2020
[15]

An investigation into neural net optimization via hessian eigenvalue density

Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. InInternational Conference on Machine Learning, pages 2232–2241. PMLR, 2019

work page 2019
[16]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 14

work page 2020
[17]

Larry C. Grove. Classical groups and geometric algebra, graduate studies in mathematics.American Mathematical Society, (ISBN 978-0-8218-2019-3, MR1859189), 2002

work page 2019
[18]

Rethinking imagenet pre-training

Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 4918–4927, 2019

work page 2019
[19]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

work page 2020
[20]

Deep networks with stochastic depth

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. InEuropean Conference on Computer Vision, pages 646–661. Springer, 2016

work page 2016
[21]

The Platonic Representation Hypothesis

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987, 2024

work page Pith review arXiv 2024
[22]

Cross-domain weakly- supervised object detection through progressive domain adaptation

Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly- supervised object detection through progressive domain adaptation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5001–5009, 2018

work page 2018
[23]

Normal multivariate analysis and the orthogonal group.The Annals of Mathematical Statistics, 25(1):40–75, 1954

Alan T James. Normal multivariate analysis and the orthogonal group.The Annals of Mathematical Statistics, 25(1):40–75, 1954

work page 1954
[24]

Improving multimodal learning with multi-loss gradient modulation.arXiv preprint arXiv:2405.07930, 2024

John Kontras, Emma Lee, and Amandeep Singh. Improving multimodal learning with multi-loss gradient modulation.arXiv preprint arXiv:2405.07930, 2024

work page arXiv 2024
[25]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Canada, 2009

work page 2009
[26]

Tiny imagenet visual recognition challenge

Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. InCS 231N, 2015

work page 2015
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Boosting multi-modal model performance with adaptive gradient modulation

Hong Li, Xingyu Li, Pengbo Hu, Yinuo Lei, Chunxiao Li, and Yi Zhou. Boosting multi-modal model performance with adaptive gradient modulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22214–22224, 2023

work page 2023
[29]

Do vision and language models share concepts? a vector space alignment study.Transactions of the Association for Computational Linguistics, 2024

Jiaang Li, Yova Kementchedjhieva, Constanza Fierro, and Anders Søgaard. Do vision and language models share concepts? a vector space alignment study.Transactions of the Association for Computational Linguistics, 2024

work page 2024
[30]

Vila: On pre-training for visual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024

work page 2024
[31]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context.arXiv preprint arXiv:1405.0312, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

Visual instruction tuning.Advances in neural information processing systems, 36, 2024

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

work page 2024
[34]

Visual perception by large language model’s weights

Feipeng Ma, Hongwei Xue, Guangting Wang, Yizhou Zhou, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, and Xiaoyan Sun. Visual perception by large language model’s weights. arXiv preprint arXiv:2405.20339, 2024. 15

work page arXiv 2024
[35]

What do neural networks learn when trained with random labels? Advances in Neural Information Processing Systems, 33:19693–19704, 2020

Hartmut Maennel, Ibrahim M Alabdulmohsin, Ilya O Tolstikhin, Robert Baldock, Olivier Bousquet, Sylvain Gelly, and Daniel Keysers. What do neural networks learn when trained with random labels? Advances in Neural Information Processing Systems, 33:19693–19704, 2020

work page 2020
[36]

Do vision and language encoders represent the world similarly? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Mohamed El Amine Seddik, Sanath Narayan, Karttikeya Mangalam, and Noel E O’Connor. Do vision and language encoders represent the world similarly? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

work page 2024
[37]

Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22 (165):1–73, 2021

Charles H Martin and Michael W Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22 (165):1–73, 2021

work page 2021
[38]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014

work page 2014
[39]

Cross-domain transferability of adversarial perturbations.Advances in Neural Information Processing Systems, 32, 2019

Muhammad Muzammal Naseer, Salman H Khan, Muhammad Haris Khan, Fahad Shahbaz Khan, and Fatih Porikli. Cross-domain transferability of adversarial perturbations.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[40]

What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020

Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020

work page 2020
[41]

Cross-domain sentiment classification via spectral feature alignment

Sinno Jialin Pan, Xiaochuan Ni, Jian-Tao Sun, Qiang Yang, and Zheng Chen. Cross-domain sentiment classification via spectral feature alignment. InProceedings of the 19th international conference on World wide web, pages 751–760, 2010

work page 2010
[42]

Balanced multimodal learning via on-the-fly gradient modulation

Xiaokang Peng, Yake Wei, Andong Deng, Dong Wang, and Di Hu. Balanced multimodal learning via on-the-fly gradient modulation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8238–8247, 2022

work page 2022
[43]

Language models are unsupervised multitask learners.OpenAI Blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI Blog, 1(8):9, 2019

work page 2019
[44]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[45]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[46]

Dsod: Learning deeply supervised object detectors from scratch

Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. InProceedings of the IEEE international conference on computer vision, pages 1919–1927, 2017

work page 1919
[47]

Does representation matter? exploring intermediate layers in large language models.arXiv preprint arXiv:2412.09563, 2024

Oscar Skean, Md Rifat Arefin, Yann LeCun, and Ravid Shwartz-Ziv. Does representation matter? exploring intermediate layers in large language models.arXiv preprint arXiv:2412.09563, 2024

work page arXiv 2024
[48]

Segmenter: Transformer for semantic segmentation

Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021

work page 2021
[49]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020. 16

work page 2020
[50]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021

work page 2021
[53]

tent: fully test-time adaptation by entropy minimization.ICLR, 2021

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. tent: fully test-time adaptation by entropy minimization.ICLR, 2021

work page 2021
[54]

Cross-domain contrastive learning for unsupervised domain adaptation.IEEE Transactions on Multimedia, 25:1665– 1673, 2022

Rui Wang, Zuxuan Wu, Zejia Weng, Jingjing Chen, Guo-Jun Qi, and Yu-Gang Jiang. Cross-domain contrastive learning for unsupervised domain adaptation.IEEE Transactions on Multimedia, 25:1665– 1673, 2022

work page 2022
[55]

Mmpareto: Boosting multimodal learning with innocent unimodal assistance.arXiv preprint arXiv:2405.17730, 2024

Xiaofeng Wei and Zhiyuan Hu. Mmpareto: Boosting multimodal learning with innocent unimodal assistance.arXiv preprint arXiv:2405.17730, 2024

work page arXiv 2024
[56]

Cdtrans: Cross-domain transformer for unsupervised domain adaptation.arXiv preprint arXiv:2109.06165, 2021

Tongkun Xu, Weihua Chen, Pichao Wang, Fan Wang, Hao Li, and Rong Jin. Cdtrans: Cross-domain transformer for unsupervised domain adaptation.arXiv preprint arXiv:2109.06165, 2021

work page arXiv 2021
[57]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis.arXiv preprint arXiv:2102.04830, 2021

Linjie Yu, Kun He, and Wenjie Zhang. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis.arXiv preprint arXiv:2102.04830, 2021

work page arXiv 2021
[59]

Multimodal representation learning by alternating unimodal adaptation

Xiaohui Zhang, Jaehong Yoon, Mohit Bansal, and Huaxiu Yao. Multimodal representation learning by alternating unimodal adaptation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27456–27466, 2024

work page 2024
[60]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017

work page 2017
[61]

Adapting object detectors via selective cross-domain alignment

Xinge Zhu, Jiangmiao Pang, Ceyuan Yang, Jianping Shi, and Dahua Lin. Adapting object detectors via selective cross-domain alignment. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 687–696, 2019

work page 2019
[62]

Rethinking pre-training and self-training.Advances in neural information processing systems, 33: 3833–3845, 2020

Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin Dogus Cubuk, and Quoc Le. Rethinking pre-training and self-training.Advances in neural information processing systems, 33: 3833–3845, 2020

work page 2020
[63]

Unis-mmc: Multimodal classification via unimodality-supervised multimodal contrastive learning.arXiv preprint arXiv:2305.09299, 2023

Heqing Zou, Meng Shen, Chen Chen, Yuchen Hu, Deepu Rajan, and Eng Siong Chng. Unis-mmc: Multimodal classification via unimodality-supervised multimodal contrastive learning.arXiv preprint arXiv:2305.09299, 2023. 17 Appendix In this appendix, we provide additional material that complements the main paper: • Section A: Detailed Proofs.Formal derivations sup...

work page arXiv 2023
[64]

from a distribution with zero mean and covarianceΣ x∈Rd×d

Input Assumptions.Each input x∈Rd is drawn i.i.d. from a distribution with zero mean and covarianceΣ x∈Rd×d. For concreteness and simplicity,x can be taken to be Gaussian (i.e.,x∼N(0,Σx), or more generally, drawn from a distributionD whose symmetry group is large enough to include all orthogonal transformations that preserveΣx

work page
[65]

In either case, each first-layer weight vectorw∈Rd interacts with x primarily through inner products⟨w,x⟩

Network Assumptions.The first layer of the neural network is either fully connected embedding or (a patch-based) convolution for image input. In either case, each first-layer weight vectorw∈Rd interacts with x primarily through inner products⟨w,x⟩. The initial weightsw∈Rd in the first layer are drawn i.i.d. from an isotropic distribution with mean zero an...

work page
[66]

This implies there is no genuine correlation between inputxand labely

Label Assumptions (Random Labels).Each training instance(x,y )has label y drawnindependently and uniformlyfrom a finite label set{1, 2,...,c},regardlessof x. This implies there is no genuine correlation between inputxand labely. 18

work page
[67]

The crucial point being random, theonlystructure the model sees is in x

Training Assumption.We train the first-layer parameters (and possibly deeper layers) by stochastic gradient descent (SGD) forT steps. The crucial point being random, theonlystructure the model sees is in x. Proof. We adopt an invariance argument using the orthogonal group [22, 16, 34]: G= { G∈O(d) ⏐⏐GT ΣxG= Σ x } ,(7) where the set of all orthogonal matri...

work page
[68]

We show that, at each iterationt, the distribution ofw remains invariant under the action ofG

Step 1 (Invariance in Distribution). We show that, at each iterationt, the distribution ofw remains invariant under the action ofG. That is, ifw∼µt is the distribution of weights at iterationt, thenGw∼µt for allG∈G

work page
[69]

Because the data distribution and the random labels provide no directional bias other thanΣx itself, the limiting covarianceΣw of w must share the same eigenspaces asΣx

Step 2 (Alignment of Covariances). Because the data distribution and the random labels provide no directional bias other thanΣx itself, the limiting covarianceΣw of w must share the same eigenspaces asΣx. Concretely, any distributionµthat is invariant under allG∈Gmust be aligned withΣx. One then shows that each eigenvalueσ2 i ofΣ x maps to a corresponding...

work page
[70]

Letw∈Rd andx∈Rd

Definition of Group Action onwandx. Letw∈Rd andx∈Rd. For eachG∈G⊆O(d), define (G·w,G·x) = (Gw,Gx),(8) SinceG∈O(d)preserves inner products,⟨Gw,Gx⟩=⟨w,x⟩

work page
[71]

Because G∈GsatisfiesGT ΣxG = Σx, it leaves the distribution ofx invariant

Invariance of Data Distribution. Because G∈GsatisfiesGT ΣxG = Σx, it leaves the distribution ofx invariant. Concretely,x∼N(0,Σx) implies Gx∼N(0,Σ x) as well. Thus sampling x and then applyingG yields a sample from thesame distribution

work page
[72]

By assumption, the initial weight distributionw0 is isotropic: E [ w0wT 0 ] = σ2I

Initial Weights Are Isotropic. By assumption, the initial weight distributionw0 is isotropic: E [ w0wT 0 ] = σ2I. Hence Gw0∼w0. This implies that att= 0, the distributionµ0 ofw 0 is invariant underG

work page
[73]

Loss Function Under Random Labels. The training loss at iterationtis L(wt;x, y) =ℓ ( ⟨wt, x⟩, y ) .(9) Sinceyis random and uncorrelated withx, each gradient update wt+1 =w t−η∇wL(wt;x,y) depends only on the scalar product⟨wt,x⟩. Crucially, ifwt and x follow distributions that are invariant under group action byG, the next update remains invariant as well

work page
[74]

SGD Update Commutes with Group Action. Formally, one must show that for eachG∈G, Gwt+1 =G ( wt−η∇wL(wt;x, y) ) = (G wt)−η∇GwtL(Gwt;G x, y),(10) where the last equality holds because⟨Gwt,Gx⟩=⟨wt,x⟩and the labely is unchanged byG. By induction, this invariance holds at every iterationt. Thus the distributionµt ofw t satisfiesG wt∼wt for allG∈G. 19 Step 2: C...

work page
[75]

Since eachVi is forced to be an invariant subspace ofΣw, we conclude thatΣw andΣ x must diagonalize in the same basis

HenceΣ w Shares Eigenspaces withΣx. Since eachVi is forced to be an invariant subspace ofΣw, we conclude thatΣw andΣ x must diagonalize in the same basis. In other words, there exist scalarsτ2 i such that Σw(v) =τ2 iv,∀v∈Vi This shows thatΣw andΣ z shareeigenvectors but differ in their eigenvaluesτ2 i vs.σ2 i

work page
[76]

transfer function

Eigenvalue Mappingσ2 i↦→τ2 i . Empirically, one observes a smooth functionf(·)such that τi≈f(σi). Intuitively, directions inx-space with larger varianceσ2 i yield (via random-label SGD) more significant updates tow, driving the correspondingτ2 i higher until other competing directions partially balance out. This effect is captured by a stable “transfer fu...

work page