pith. machine review for the scientific record. sign in

arxiv: 2604.12391 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Anbang Yao, Chao Li, Jiawei Fan, Shigeng Wang, Xiaolong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords chain-of-models pre-trainingvision foundation modelstraining accelerationknowledge transfermodel family scalingpre-training efficiency
0
0 comments X

The pith

Pre-training vision foundation model families from smallest to largest reuses knowledge to match or exceed individual training performance at far lower total cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Chain-of-Models Pre-Training to accelerate training for entire families of vision foundation models instead of handling each model in isolation. Only the smallest model receives full individual pre-training; each subsequent larger model is trained by sequentially transferring knowledge backward from its smaller predecessors, reusing information jointly in the parameter space and the feature space. This produces performance that is mostly superior to standard separate training while cutting overall compute, with the efficiency advantage growing as the family includes more models. The claim is tested across 45 datasets on both zero-shot and fine-tuning tasks and shown to be independent of the underlying pre-training method.

Core claim

CoM-PT sets up an ascending-size model chain in which only the smallest model undergoes standard pre-training while the others are trained via sequential inverse knowledge transfer that reuses knowledge from smaller predecessors in both parameter and feature spaces, yielding mostly superior performance at substantially lower training cost and higher efficiency as the number of models in the family increases.

What carries the argument

The ascending model chain that performs sequential inverse knowledge transfer by jointly reusing knowledge from smaller predecessors in parameter space and feature space.

If this is right

  • All models in the chain mostly outperform models trained individually on zero-shot and fine-tuning tasks across 45 datasets.
  • Computational cost drops sharply; for example, prepending smaller models up to ViT-L reduces complexity by as much as 72 percent.
  • Acceleration ratios rise with family size, moving from 4.13X for three models to 7.09X for seven models.
  • The method applies regardless of the specific pre-training paradigm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ascending-chain pattern could be tested on language-model families to measure whether efficiency gains scale similarly.
  • Model development practices might shift toward designing compatible size families rather than isolated large models to capture cumulative savings.
  • The transfer mechanism might combine with other acceleration methods for still larger gains in compute-intensive regimes.

Load-bearing premise

That knowledge transferred from smaller to larger models can be reused in a way that keeps or improves performance without adding overhead that cancels the overall savings.

What would settle it

A controlled run on a new model family in which the total compute for the CoM-PT chain exceeds the sum of individual trainings or in which downstream accuracy falls below the individually trained baselines.

Figures

Figures reproduced from arXiv: 2604.12391 by Anbang Yao, Chao Li, Jiawei Fan, Shigeng Wang, Xiaolong Liu.

Figure 1
Figure 1. Figure 1: Standard individual pre-training vs. Chain-of-Models Pre-Training. All ViTs [17] are trained on a combined dataset of CC3M [59] and CC12M [4], and evaluated on ImageNet-1K [14]. As a representative example, the emergence of contrastive language-image pre-training (CLIP) [52] enables vision transformers (ViTs) to achieve state-of-the-art performance in zero-shot tasks and solidifies them as the de facto vi￾… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Chain-of-Models Pre-training pipeline with inverse knowledge transfer relay. Models are organized sequentially by ascending model size, from the smallest model m1 to the largest mn, forming a model chain. m1 first undergoes standard individual pre-training. Each subsequent mi is then efficiently trained via inverse knowledge transfer (denoted as ) from its immediate predecessor mi−1, forming a … view at source ↗
Figure 3
Figure 3. Figure 3: Inverse knowledge transfer from mi to mi+1. block-block width differences, we directly insert the param￾eters of the small teacher into the large student, leaving the remaining parameters randomly initialized; ii) for layer-layer depth differences, we duplicate weights of each layer and in￾dex it as the succeeding layer. See Table H of Supplementary Materials for comparisons against sophisticated designs. … view at source ↗
Figure 4
Figure 4. Figure 4: Convergence analysis across different training epochs. We train ViT-T/16, ViT-S/16, and ViT-B/16 on CC3M for 32, 64, 128, and 256 epochs to determine convergence points. Top-1 (%) represents zero-shot classification accuracy on ImageNet-1K. Evaluation Protocols. We evaluate model performance across 45 datasets spanning zero-shot tasks and fine-tuning tasks. i) Zero-shot performance is assessed on ImageNet-… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reduced training MACs vs. model chain starting with smaller models. The reduced MACs represent the total training cost of the model chain relative to the cost of individually pre￾training ViT-L. All models are performance-lossless. Vision-Language Tasks. Vision encoders pre-trained with CoM-PT achieve highly competitive performance against baselines across four datasets. It outperforms baseline in text und… view at source ↗
Figure 9
Figure 9. Figure 9: Acceleration ratio variation under different training epochs and data scales. Experiments use the ViT family. (a) On CC3M, we vary the training epochs (32, 64, and 128). (b) We test the data scale from 22.0M to 206.8M under a fixed 64-epoch setting to isolate data scale effects. The curve fitting format follows [77]. 5.4.2. Pre-training Acceleration under Different Setups Acceleration under Various Trainin… view at source ↗
read the original abstract

In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Chain-of-Models Pre-Training (CoM-PT), a training acceleration method for vision foundation models that arranges models into an ascending-size chain. Only the smallest model receives standard individual pre-training; all larger models are trained via sequential inverse knowledge transfer that jointly reuses knowledge from predecessors in both parameter space and feature space. The central claims are that this yields performance that is mostly superior (or at least non-inferior) to independent training, delivers substantial compute savings (e.g., up to 72% reduction for ViT-L on CC3M), and exhibits an efficient scaling property in which adding more models to the family increases the acceleration ratio (4.13X to 7.09X). These results are reported across 45 datasets covering zero-shot and fine-tuning tasks, and the method is presented as agnostic to specific pre-training paradigms.

Significance. If the empirical results hold, the work is significant because it shifts the optimization target from individual models to entire model families and demonstrates that sequential inverse transfer can preserve or improve performance while reducing total compute. The counter-intuitive scaling observation—that larger families become more efficient—is noteworthy and, if reproducible, could influence how vision foundation model suites are trained. Credit is due for the extensive validation on 45 datasets and for open-sourcing the code, both of which support reproducibility and extension to other domains such as language-model pre-training.

major comments (2)
  1. [§4.3] §4.3 and Table 4: the claim that performance is 'mostly superior' across all 45 datasets requires a clearer breakdown (number of datasets showing statistically significant gains, parity, or degradation) and per-model-size results; without this granularity the 'mostly' qualifier remains difficult to evaluate against the central performance claim.
  2. [§3.2] §3.2, Eq. (3)–(5): the joint parameter- and feature-space transfer mechanism is load-bearing for both the performance and efficiency claims; the manuscript should explicitly report the additional FLOPs or memory cost of the feature-space reuse step itself so that readers can verify that the reported net savings (e.g., 72%) are not offset by transfer overhead.
minor comments (3)
  1. The abstract states 'performance-lossless' while the body uses 'mostly superior'; harmonize the terminology and define the precise acceptance criterion for 'superior'.
  2. [Figure 3] Figure 3 (scaling curves) and the associated text would benefit from error bars or multiple random seeds to substantiate the efficiency-leap claim when the family grows from 3 to 7 models.
  3. The open-sourced code link is appreciated; please ensure the released repository includes the exact hyper-parameters and data splits used for the 45-dataset evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The comments identify opportunities to strengthen the clarity of our performance claims and the transparency of our efficiency analysis. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§4.3] §4.3 and Table 4: the claim that performance is 'mostly superior' across all 45 datasets requires a clearer breakdown (number of datasets showing statistically significant gains, parity, or degradation) and per-model-size results; without this granularity the 'mostly' qualifier remains difficult to evaluate against the central performance claim.

    Authors: We agree that a granular breakdown is necessary to substantiate the 'mostly superior' claim. In the revised manuscript we will augment §4.3 with a new table (or expanded Table 4) that reports, for each model size in the chain: (i) the number of datasets showing statistically significant gains (using paired t-tests at p<0.05), (ii) the number showing parity within a small tolerance, and (iii) any cases of degradation. We will also include per-model-size average metrics across the 45 datasets to allow direct comparison of improvement magnitude at each scale. revision: yes

  2. Referee: [§3.2] §3.2, Eq. (3)–(5): the joint parameter- and feature-space transfer mechanism is load-bearing for both the performance and efficiency claims; the manuscript should explicitly report the additional FLOPs or memory cost of the feature-space reuse step itself so that readers can verify that the reported net savings (e.g., 72%) are not offset by transfer overhead.

    Authors: We appreciate this request for explicit accounting. The feature-space reuse component in Eq. (3)–(5) adds a modest overhead from the additional feature extraction and alignment computations during each transfer step. In the revision we will insert a dedicated paragraph in §3.2 that quantifies the extra FLOPs and peak memory for this step across the model sizes used in our experiments. We will then recompute and report the net training-cost reduction (including this overhead) for the CC3M ViT-L case and the scaling experiments, confirming that the headline savings figures remain valid after subtraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces CoM-PT as a new pre-training method that chains models by size and reuses knowledge from smaller to larger ones via parameter and feature space transfer. The central claims rest on empirical results across 45 datasets and scaling experiments rather than any self-referential definition, fitted input renamed as prediction, or load-bearing self-citation. No equations are presented that reduce the claimed acceleration or performance gains to the method's own inputs by construction, and the approach is described as agnostic to specific paradigms with open-sourced code for independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The method builds on existing knowledge transfer concepts but introduces a new chaining strategy.

pith-pipeline@v0.9.0 · 5635 in / 1332 out tokens · 54668 ms · 2026-05-10T16:12:46.021325+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

85 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 13

  2. [2]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    Coco- stuff: Thing and stuff classes in context

    Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InCVPR, 2018. 7, 12

  4. [4]

    Conceptual 12M: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InCVPR,

  5. [5]

    bert2bert: Towards reusable pretrained language models

    Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, and Qun Liu. bert2bert: Towards reusable pretrained language models. InACL, 2022. 3

  6. [6]

    Cross-layer distillation with semantic calibration

    Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Cross-layer distillation with semantic calibration. InAAAI, 2021. 3

  7. [7]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Im- proving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023. 5

  8. [8]

    Distilling knowledge via knowledge review

    Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. InCVPR, 2021. 3

  9. [9]

    Net2net: Accelerating learning via knowl- edge transfer

    Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.arXiv preprint arXiv:1511.05641, 2015. 3, 15

  10. [10]

    Improved feature distillation via projec- tor ensemble

    Yudong Chen, Sen Wang, Jiajun Liu, Xuwei Xu, Frank de Hoog, and Zi Huang. Improved feature distillation via projec- tor ensemble. InNeurIPS, 2022. 3

  11. [11]

    Lemon: Reviving stronger and smaller lms from larger lms with linear parameter fusion

    Yilong Chen, Junyuan Shang, Zhenyu Zhang, Shiyao Cui, Tingwen Liu, Shuohuan Wang, Yu Sun, and Hua Wu. Lemon: Reviving stronger and smaller lms from larger lms with linear parameter fusion. InACL, 2024. 3

  12. [12]

    Clip benchmark, 2025

    Mehdi Cherti and Romain Beaumont. Clip benchmark, 2025. 12

  13. [13]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InCVPR,

  14. [14]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 1, 5, 12

  15. [15]

    Distpro: Searching a fast knowledge distillation process via meta optimization

    Xueqing Deng, Dawei Sun, Shawn Newsam, and Peng Wang. Distpro: Searching a fast knowledge distillation process via meta optimization. InECCV, 2022. 3

  16. [16]

    Network expansion for practical training acceleration

    Ning Ding, Yehui Tang, Kai Han, Chao Xu, and Yunhe Wang. Network expansion for practical training acceleration. In CVPR, 2023. 3, 15

  17. [17]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2020. 1, 5

  18. [18]

    The pascal visual object classes challenge 2012 (voc2012) development kit.PASMCL,

    Mark Everingham and John Winn. The pascal visual object classes challenge 2012 (voc2012) development kit.PASMCL,

  19. [19]

    Improving clip training with language rewrites

    Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yon- glong Tian. Improving clip training with language rewrites

  20. [20]

    Pyramidclip: Hierarchical feature alignment for vision-language model pretraining

    Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Ron- grong Ji, and Chunhua Shen. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining. In NeurIPS, 2022. 3

  21. [21]

    Efficient training of bert by progressively stacking

    Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of bert by progressively stacking. InICML, 2019. 3

  22. [22]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017. 5, 13

  23. [23]

    Online knowledge distilla- tion via collaborative learning

    Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. Online knowledge distilla- tion via collaborative learning. InCVPR, 2020. 3

  24. [24]

    A comprehensive overhaul of feature distillation

    Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. InICCV, 2019. 3

  25. [25]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

  26. [26]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR,

  27. [27]

    Masked distillation with receptive tokens

    Tao Huang, Yuan Zhang, Shan You, Fei Wang, Chen Qian, Jian Cao, and Chang Xu. Masked distillation with receptive tokens. InICLR, 2023. 3

  28. [28]

    Accelerating pre-training of multimodal llms via chain-of-sight

    Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qing- long Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, and Ming Yang. Accelerating pre-training of multimodal llms via chain-of-sight. 2024. 1, 3

  29. [29]

    Juwels booster–a supercomputer for large-scale ai research

    Stefan Kesselheim, Andreas Herten, Kai Krajsek, Jan Ebert, Jenia Jitsev, Mehdi Cherti, Michael Langguth, Bing Gong, Scarlet Stadtler, Amirpasha Mozaffari, et al. Juwels booster–a supercomputer for large-scale ai research. InICHPC, 2021. 1, 3

  30. [30]

    SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling.arXiv preprint arXiv:2312.15166, 2023

    Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166, 2023. 3

  31. [31]

    Cosmos: Cross-modality self-distillation for vision language pre-training

    Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, and Zeynep Akata. Cosmos: Cross-modality self-distillation for vision language pre-training. InCVPR,

  32. [32]

    Otter: A multi-modal model with in-context instruction tuning.PAMI, 2025

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.PAMI, 2025. 5

  33. [33]

    Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023. 12

  34. [34]

    Pytorch distributed: experiences on accelerating data parallel training

    Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: experiences on accelerating data parallel training. InVLDB Endowment,

  35. [35]

    Supervi- sion exists everywhere: A data efficient contrastive language- image pre-training paradigm

    Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Supervi- sion exists everywhere: A data efficient contrastive language- image pre-training paradigm. InICLR, 2022. 1, 3, 5

  36. [36]

    Evaluating Object Hallucination in Large Vision-Language Models

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 13

  37. [37]

    Scaling language-image pre-training via masking

    Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 1, 3

  38. [38]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5, 12

  39. [39]

    Weight distillation: Transferring the knowledge in neural network parameters.arXiv preprint arXiv:2009.09152, 2020

    Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, and Jingbo Zhu. Weight distillation: Transferring the knowledge in neural network parameters.arXiv preprint arXiv:2009.09152, 2020. 3

  40. [40]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 13

  41. [41]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

  42. [42]

    Norm: Knowledge distillation via n-to-one representation matching

    Xiaolong Liu, LUKING LI, Chao Li, and Anbang Yao. Norm: Knowledge distillation via n-to-one representation matching. InICLR, 2023. 3

  43. [43]

    Mllms- augmented visual-language representation learning.arXiv preprint arXiv:2311.18765, 2023

    Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, and Yang You. Mllms- augmented visual-language representation learning.arXiv preprint arXiv:2311.18765, 2023. 3, 5, 12

  44. [44]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 5

  45. [45]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 15

  46. [46]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,

  47. [47]

    Mixed precision training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. InICLR, 2018. 1, 3

  48. [48]

    The role of context for object detection and semantic segmentation in the wild

    Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InCVPR, 2014. 5, 12

  49. [49]

    Slip: Self-supervision meets language-image pre- training

    Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. Slip: Self-supervision meets language-image pre- training. InECCV, 2022. 3

  50. [50]

    Im2text: Describing images using 1 million captioned photographs

    Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011. 12

  51. [51]

    Budgeted training for vision transformer

    Xuran Pan, Xuan Jin, Yuan He, Shiji Song, Gao Huang, et al. Budgeted training for vision transformer. InICLR, 2022. 3

  52. [52]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 3, 5

  53. [53]

    Zero: Memory optimizations toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yux- iong He. Zero: Memory optimizations toward training trillion parameter models. InSC20, 2020. 3

  54. [54]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yux- iong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In SIGKDD, 2020. 1, 3

  55. [55]

    Fitnets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. InICLR, 2015. 3

  56. [56]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 3

  57. [57]

    Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs

    Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Jenia Jitsev, Theo Coombes, and Clayton Mullis. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. InNeurIPSW, 2021. 5, 12

  58. [58]

    Laion-5b: An open large-scale dataset for training next gener- ation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gener- ation image-text models. InNeurIPS, 2022. 1, 5, 12

  59. [59]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InACL, 2018. 1, 4, 12

  60. [60]

    Pre-trained summa- rization distillation.arXiv preprint arXiv:2010.13002, 2020

    Sam Shleifer and Alexander M Rush. Pre-trained summa- rization distillation.arXiv preprint arXiv:2010.13002, 2020. 3

  61. [61]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, 2019. 5, 13

  62. [62]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 3 10

  63. [63]

    EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024

    Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. Eva-clip- 18b: Scaling clip to 18 billion parameters.arXiv preprint arXiv:2402.04252, 2024. 3

  64. [64]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 5

  65. [65]

    Mimetic initialization of self-attention layers

    Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. InICML, 2023. 3

  66. [66]

    Efficienttrain: Exploring gener- alized curriculum learning for training visual backbones

    Yulin Wang, Yang Yue, Rui Lu, Tianjiao Liu, Zhao Zhong, Shiji Song, and Gao Huang. Efficienttrain: Exploring gener- alized curriculum learning for training visual backbones. In ICCV, 2023. 3

  67. [67]

    arXiv preprint arXiv:2407.01445 , year=

    Xiyuan Wei, Fanjiang Ye, Ori Yonay, Xingyu Chen, Baixi Sun, Dingwen Tao, and Tianbao Yang. Fastclip: A suite of optimization techniques to accelerate clip training with limited resources.arXiv preprint arXiv:2407.01445, 2024. 1

  68. [68]

    Lotlip: Improving language-image pre-training for long text understanding

    Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, and Zheng- Jun Zha. Lotlip: Improving language-image pre-training for long text understanding. InNeurIPS, 2024. 3, 5

  69. [69]

    Initializ- ing variable-sized vision transformers from learngene with learnable transformation

    Shiyu Xia, Yuankun Zu, Xu Yang, and Xin Geng. Initializ- ing variable-sized vision transformers from learngene with learnable transformation. InNeurIPS, 2024. 3

  70. [70]

    San: side adapter network for open-vocabulary semantic segmentation.TPAMI, 2023

    Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. San: side adapter network for open-vocabulary semantic segmentation.TPAMI, 2023. 5, 7, 13

  71. [71]

    Initializing models with larger ones

    Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InICLR, 2024. 3

  72. [72]

    Alip: Adaptive language-image pre-training with synthetic caption

    Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, and Tongliang Liu. Alip: Adaptive language-image pre-training with synthetic caption. InICCV,

  73. [73]

    Masked generative distillation

    Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked generative distillation. In ECCV, 2022. 3

  74. [74]

    Filip: Fine-grained interactive language-image pre-training

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. InICLR, 2022. 3

  75. [75]

    A gift from knowledge distillation: Fast optimization, network minimization and transfer learning

    Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. InCVPR, 2017. 3

  76. [76]

    Udon: Universal dynamic online distil- lation for generic image representations

    Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, and Ondrej Chum. Udon: Universal dynamic online distil- lation for generic image representations. InNeurIPS, 2024. 3

  77. [77]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InCVPR, 2022. 8

  78. [78]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 1

  79. [79]

    Deep mutual learning

    Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. InCVPR, 2018. 3

  80. [80]

    Dreamlip: Language- image pre-training with long captions

    Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InECCV, 2024. 1, 3, 4, 5, 12

Showing first 80 references.