arxiv: 2604.12391 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

Anbang Yao, Chao Li, Jiawei Fan, Shigeng Wang, Xiaolong Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:12 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords chain-of-models pre-trainingvision foundation modelstraining accelerationknowledge transfermodel family scalingpre-training efficiency

0 comments

The pith

Pre-training vision foundation model families from smallest to largest reuses knowledge to match or exceed individual training performance at far lower total cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Chain-of-Models Pre-Training to accelerate training for entire families of vision foundation models instead of handling each model in isolation. Only the smallest model receives full individual pre-training; each subsequent larger model is trained by sequentially transferring knowledge backward from its smaller predecessors, reusing information jointly in the parameter space and the feature space. This produces performance that is mostly superior to standard separate training while cutting overall compute, with the efficiency advantage growing as the family includes more models. The claim is tested across 45 datasets on both zero-shot and fine-tuning tasks and shown to be independent of the underlying pre-training method.

Core claim

CoM-PT sets up an ascending-size model chain in which only the smallest model undergoes standard pre-training while the others are trained via sequential inverse knowledge transfer that reuses knowledge from smaller predecessors in both parameter and feature spaces, yielding mostly superior performance at substantially lower training cost and higher efficiency as the number of models in the family increases.

What carries the argument

The ascending model chain that performs sequential inverse knowledge transfer by jointly reusing knowledge from smaller predecessors in parameter space and feature space.

If this is right

All models in the chain mostly outperform models trained individually on zero-shot and fine-tuning tasks across 45 datasets.
Computational cost drops sharply; for example, prepending smaller models up to ViT-L reduces complexity by as much as 72 percent.
Acceleration ratios rise with family size, moving from 4.13X for three models to 7.09X for seven models.
The method applies regardless of the specific pre-training paradigm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ascending-chain pattern could be tested on language-model families to measure whether efficiency gains scale similarly.
Model development practices might shift toward designing compatible size families rather than isolated large models to capture cumulative savings.
The transfer mechanism might combine with other acceleration methods for still larger gains in compute-intensive regimes.

Load-bearing premise

That knowledge transferred from smaller to larger models can be reused in a way that keeps or improves performance without adding overhead that cancels the overall savings.

What would settle it

A controlled run on a new model family in which the total compute for the CoM-PT chain exceeds the sum of individual trainings or in which downstream accuracy falls below the individually trained baselines.

Figures

Figures reproduced from arXiv: 2604.12391 by Anbang Yao, Chao Li, Jiawei Fan, Shigeng Wang, Xiaolong Liu.

**Figure 1.** Figure 1: Standard individual pre-training vs. Chain-of-Models Pre-Training. All ViTs [17] are trained on a combined dataset of CC3M [59] and CC12M [4], and evaluated on ImageNet-1K [14]. As a representative example, the emergence of contrastive language-image pre-training (CLIP) [52] enables vision transformers (ViTs) to achieve state-of-the-art performance in zero-shot tasks and solidifies them as the de facto vi… view at source ↗

**Figure 2.** Figure 2: Overview of Chain-of-Models Pre-training pipeline with inverse knowledge transfer relay. Models are organized sequentially by ascending model size, from the smallest model m1 to the largest mn, forming a model chain. m1 first undergoes standard individual pre-training. Each subsequent mi is then efficiently trained via inverse knowledge transfer (denoted as ) from its immediate predecessor mi−1, forming a … view at source ↗

**Figure 3.** Figure 3: Inverse knowledge transfer from mi to mi+1. block-block width differences, we directly insert the parameters of the small teacher into the large student, leaving the remaining parameters randomly initialized; ii) for layer-layer depth differences, we duplicate weights of each layer and index it as the succeeding layer. See Table H of Supplementary Materials for comparisons against sophisticated designs. … view at source ↗

**Figure 4.** Figure 4: Convergence analysis across different training epochs. We train ViT-T/16, ViT-S/16, and ViT-B/16 on CC3M for 32, 64, 128, and 256 epochs to determine convergence points. Top-1 (%) represents zero-shot classification accuracy on ImageNet-1K. Evaluation Protocols. We evaluate model performance across 45 datasets spanning zero-shot tasks and fine-tuning tasks. i) Zero-shot performance is assessed on ImageNet-… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Reduced training MACs vs. model chain starting with smaller models. The reduced MACs represent the total training cost of the model chain relative to the cost of individually pretraining ViT-L. All models are performance-lossless. Vision-Language Tasks. Vision encoders pre-trained with CoM-PT achieve highly competitive performance against baselines across four datasets. It outperforms baseline in text und… view at source ↗

**Figure 9.** Figure 9: Acceleration ratio variation under different training epochs and data scales. Experiments use the ViT family. (a) On CC3M, we vary the training epochs (32, 64, and 128). (b) We test the data scale from 22.0M to 206.8M under a fixed 64-epoch setting to isolate data scale effects. The curve fitting format follows [77]. 5.4.2. Pre-training Acceleration under Different Setups Acceleration under Various Trainin… view at source ↗

read the original abstract

In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoM-PT chains vision models by size and transfers knowledge inversely from small to large, delivering real compute savings on the tested setups while keeping performance mostly on par or better.

read the letter

The paper's main contribution is a model chain for vision foundation model families: train the smallest one normally, then use sequential inverse transfer in both parameter and feature space to handle the rest. This targets family-level acceleration instead of optimizing models one by one, and the scaling behavior is the part that stands out—adding more models to the chain improves the overall efficiency ratio rather than hurting it.

Referee Report

2 major / 3 minor

Summary. The paper introduces Chain-of-Models Pre-Training (CoM-PT), a training acceleration method for vision foundation models that arranges models into an ascending-size chain. Only the smallest model receives standard individual pre-training; all larger models are trained via sequential inverse knowledge transfer that jointly reuses knowledge from predecessors in both parameter space and feature space. The central claims are that this yields performance that is mostly superior (or at least non-inferior) to independent training, delivers substantial compute savings (e.g., up to 72% reduction for ViT-L on CC3M), and exhibits an efficient scaling property in which adding more models to the family increases the acceleration ratio (4.13X to 7.09X). These results are reported across 45 datasets covering zero-shot and fine-tuning tasks, and the method is presented as agnostic to specific pre-training paradigms.

Significance. If the empirical results hold, the work is significant because it shifts the optimization target from individual models to entire model families and demonstrates that sequential inverse transfer can preserve or improve performance while reducing total compute. The counter-intuitive scaling observation—that larger families become more efficient—is noteworthy and, if reproducible, could influence how vision foundation model suites are trained. Credit is due for the extensive validation on 45 datasets and for open-sourcing the code, both of which support reproducibility and extension to other domains such as language-model pre-training.

major comments (2)

[§4.3] §4.3 and Table 4: the claim that performance is 'mostly superior' across all 45 datasets requires a clearer breakdown (number of datasets showing statistically significant gains, parity, or degradation) and per-model-size results; without this granularity the 'mostly' qualifier remains difficult to evaluate against the central performance claim.
[§3.2] §3.2, Eq. (3)–(5): the joint parameter- and feature-space transfer mechanism is load-bearing for both the performance and efficiency claims; the manuscript should explicitly report the additional FLOPs or memory cost of the feature-space reuse step itself so that readers can verify that the reported net savings (e.g., 72%) are not offset by transfer overhead.

minor comments (3)

The abstract states 'performance-lossless' while the body uses 'mostly superior'; harmonize the terminology and define the precise acceptance criterion for 'superior'.
[Figure 3] Figure 3 (scaling curves) and the associated text would benefit from error bars or multiple random seeds to substantiate the efficiency-leap claim when the family grows from 3 to 7 models.
The open-sourced code link is appreciated; please ensure the released repository includes the exact hyper-parameters and data splits used for the 45-dataset evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The comments identify opportunities to strengthen the clarity of our performance claims and the transparency of our efficiency analysis. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§4.3] §4.3 and Table 4: the claim that performance is 'mostly superior' across all 45 datasets requires a clearer breakdown (number of datasets showing statistically significant gains, parity, or degradation) and per-model-size results; without this granularity the 'mostly' qualifier remains difficult to evaluate against the central performance claim.

Authors: We agree that a granular breakdown is necessary to substantiate the 'mostly superior' claim. In the revised manuscript we will augment §4.3 with a new table (or expanded Table 4) that reports, for each model size in the chain: (i) the number of datasets showing statistically significant gains (using paired t-tests at p<0.05), (ii) the number showing parity within a small tolerance, and (iii) any cases of degradation. We will also include per-model-size average metrics across the 45 datasets to allow direct comparison of improvement magnitude at each scale. revision: yes
Referee: [§3.2] §3.2, Eq. (3)–(5): the joint parameter- and feature-space transfer mechanism is load-bearing for both the performance and efficiency claims; the manuscript should explicitly report the additional FLOPs or memory cost of the feature-space reuse step itself so that readers can verify that the reported net savings (e.g., 72%) are not offset by transfer overhead.

Authors: We appreciate this request for explicit accounting. The feature-space reuse component in Eq. (3)–(5) adds a modest overhead from the additional feature extraction and alignment computations during each transfer step. In the revision we will insert a dedicated paragraph in §3.2 that quantifies the extra FLOPs and peak memory for this step across the model sizes used in our experiments. We will then recompute and report the net training-cost reduction (including this overhead) for the CC3M ViT-L case and the scaling experiments, confirming that the headline savings figures remain valid after subtraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces CoM-PT as a new pre-training method that chains models by size and reuses knowledge from smaller to larger ones via parameter and feature space transfer. The central claims rest on empirical results across 45 datasets and scaling experiments rather than any self-referential definition, fitted input renamed as prediction, or load-bearing self-citation. No equations are presented that reduce the claimed acceleration or performance gains to the method's own inputs by construction, and the approach is described as agnostic to specific paradigms with open-sourced code for independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities. The method builds on existing knowledge transfer concepts but introduces a new chaining strategy.

pith-pipeline@v0.9.0 · 5635 in / 1332 out tokens · 54668 ms · 2026-05-10T16:12:46.021325+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain... sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Inverse Weight Initialization... Depth Duplication And Width Insertion... Inverse Feature Distillation L_IFD(F^t, F^s)=α||F^t−T(F^s)||²₂

Reference graph

Works this paper leans on

85 extracted references · 17 canonical work pages · 10 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Coco- stuff: Thing and stuff classes in context

Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco- stuff: Thing and stuff classes in context. InCVPR, 2018. 7, 12

2018
[4]

Conceptual 12M: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12M: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. InCVPR,
[5]

bert2bert: Towards reusable pretrained language models

Cheng Chen, Yichun Yin, Lifeng Shang, Xin Jiang, Yujia Qin, Fengyu Wang, Zhi Wang, Xiao Chen, Zhiyuan Liu, and Qun Liu. bert2bert: Towards reusable pretrained language models. InACL, 2022. 3

2022
[6]

Cross-layer distillation with semantic calibration

Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Cross-layer distillation with semantic calibration. InAAAI, 2021. 3

2021
[7]

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Im- proving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023. 5

work page internal anchor Pith review arXiv 2023
[8]

Distilling knowledge via knowledge review

Pengguang Chen, Shu Liu, Hengshuang Zhao, and Jiaya Jia. Distilling knowledge via knowledge review. InCVPR, 2021. 3

2021
[9]

Net2net: Accelerating learning via knowl- edge transfer

Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer.arXiv preprint arXiv:1511.05641, 2015. 3, 15

work page arXiv 2015
[10]

Improved feature distillation via projec- tor ensemble

Yudong Chen, Sen Wang, Jiajun Liu, Xuwei Xu, Frank de Hoog, and Zi Huang. Improved feature distillation via projec- tor ensemble. InNeurIPS, 2022. 3

2022
[11]

Lemon: Reviving stronger and smaller lms from larger lms with linear parameter fusion

Yilong Chen, Junyuan Shang, Zhenyu Zhang, Shiyao Cui, Tingwen Liu, Shuohuan Wang, Yu Sun, and Hua Wu. Lemon: Reviving stronger and smaller lms from larger lms with linear parameter fusion. InACL, 2024. 3

2024
[12]

Clip benchmark, 2025

Mehdi Cherti and Romain Beaumont. Clip benchmark, 2025. 12

2025
[13]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InCVPR,
[14]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 1, 5, 12

2009
[15]

Distpro: Searching a fast knowledge distillation process via meta optimization

Xueqing Deng, Dawei Sun, Shawn Newsam, and Peng Wang. Distpro: Searching a fast knowledge distillation process via meta optimization. InECCV, 2022. 3

2022
[16]

Network expansion for practical training acceleration

Ning Ding, Yehui Tang, Kai Han, Chao Xu, and Yunhe Wang. Network expansion for practical training acceleration. In CVPR, 2023. 3, 15

2023
[17]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, G Heigold, S Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2020. 1, 5

2020
[18]

The pascal visual object classes challenge 2012 (voc2012) development kit.PASMCL,

Mark Everingham and John Winn. The pascal visual object classes challenge 2012 (voc2012) development kit.PASMCL,

2012
[19]

Improving clip training with language rewrites

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yon- glong Tian. Improving clip training with language rewrites
[20]

Pyramidclip: Hierarchical feature alignment for vision-language model pretraining

Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Ron- grong Ji, and Chunhua Shen. Pyramidclip: Hierarchical feature alignment for vision-language model pretraining. In NeurIPS, 2022. 3

2022
[21]

Efficient training of bert by progressively stacking

Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, and Tieyan Liu. Efficient training of bert by progressively stacking. InICML, 2019. 3

2019
[22]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InCVPR, 2017. 5, 13

2017
[23]

Online knowledge distilla- tion via collaborative learning

Qiushan Guo, Xinjiang Wang, Yichao Wu, Zhipeng Yu, Ding Liang, Xiaolin Hu, and Ping Luo. Online knowledge distilla- tion via collaborative learning. InCVPR, 2020. 3

2020
[24]

A comprehensive overhaul of feature distillation

Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. InICCV, 2019. 3

2019
[25]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InICLR,
[27]

Masked distillation with receptive tokens

Tao Huang, Yuan Zhang, Shan You, Fei Wang, Chen Qian, Jian Cao, and Chang Xu. Masked distillation with receptive tokens. InICLR, 2023. 3

2023
[28]

Accelerating pre-training of multimodal llms via chain-of-sight

Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qing- long Zhang, Kecheng Zheng, Jian Wang, Jingdong Chen, and Ming Yang. Accelerating pre-training of multimodal llms via chain-of-sight. 2024. 1, 3

2024
[29]

Juwels booster–a supercomputer for large-scale ai research

Stefan Kesselheim, Andreas Herten, Kai Krajsek, Jan Ebert, Jenia Jitsev, Mehdi Cherti, Michael Langguth, Bing Gong, Scarlet Stadtler, Amirpasha Mozaffari, et al. Juwels booster–a supercomputer for large-scale ai research. InICHPC, 2021. 1, 3

2021
[30]

SOLAR 10.7B: Scaling large language models with simple yet effective depth up-scaling.arXiv preprint arXiv:2312.15166, 2023

Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166, 2023. 3

work page arXiv 2023
[31]

Cosmos: Cross-modality self-distillation for vision language pre-training

Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, and Zeynep Akata. Cosmos: Cross-modality self-distillation for vision language pre-training. InCVPR,
[32]

Otter: A multi-modal model with in-context instruction tuning.PAMI, 2025

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Joshua Adrian Cahyono, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning.PAMI, 2025. 5

2025
[33]

Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InICML, 2023. 12

2023
[34]

Pytorch distributed: experiences on accelerating data parallel training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: experiences on accelerating data parallel training. InVLDB Endowment,
[35]

Supervi- sion exists everywhere: A data efficient contrastive language- image pre-training paradigm

Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Supervi- sion exists everywhere: A data efficient contrastive language- image pre-training paradigm. InICLR, 2022. 1, 3, 5

2022
[36]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucina- tion in large vision-language models.arXiv preprint arXiv:2305.10355, 2023. 5, 13

work page internal anchor Pith review arXiv 2023
[37]

Scaling language-image pre-training via masking

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichten- hofer, and Kaiming He. Scaling language-image pre-training via masking. InCVPR, 2023. 1, 3

2023
[38]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 5, 12

2014
[39]

Weight distillation: Transferring the knowledge in neural network parameters.arXiv preprint arXiv:2009.09152, 2020

Ye Lin, Yanyang Li, Ziyang Wang, Bei Li, Quan Du, Tong Xiao, and Jingbo Zhu. Weight distillation: Transferring the knowledge in neural network parameters.arXiv preprint arXiv:2009.09152, 2020. 3

work page arXiv 2009
[40]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023. 13

2023
[41]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,
[42]

Norm: Knowledge distillation via n-to-one representation matching

Xiaolong Liu, LUKING LI, Chao Li, and Anbang Yao. Norm: Knowledge distillation via n-to-one representation matching. InICLR, 2023. 3

2023
[43]

Mllms- augmented visual-language representation learning.arXiv preprint arXiv:2311.18765, 2023

Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, and Yang You. Mllms- augmented visual-language representation learning.arXiv preprint arXiv:2311.18765, 2023. 3, 5, 12

work page arXiv 2023
[44]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 5

2021
[45]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 15

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InNeurIPS,
[47]

Mixed precision training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. InICLR, 2018. 1, 3

2018
[48]

The role of context for object detection and semantic segmentation in the wild

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. InCVPR, 2014. 5, 12

2014
[49]

Slip: Self-supervision meets language-image pre- training

Norman Mu, Alexander Kirillov, David Wagner, and Sain- ing Xie. Slip: Self-supervision meets language-image pre- training. InECCV, 2022. 3

2022
[50]

Im2text: Describing images using 1 million captioned photographs

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. In NIPS, 2011. 12

2011
[51]

Budgeted training for vision transformer

Xuran Pan, Xuan Jin, Yuan He, Shiji Song, Gao Huang, et al. Budgeted training for vision transformer. InICLR, 2022. 3

2022
[52]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1, 3, 5

2021
[53]

Zero: Memory optimizations toward training trillion parameter models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yux- iong He. Zero: Memory optimizations toward training trillion parameter models. InSC20, 2020. 3

2020
[54]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yux- iong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In SIGKDD, 2020. 1, 3

2020
[55]

Fitnets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. InICLR, 2015. 3

2015
[56]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019. 3

work page internal anchor Pith review arXiv 1910
[57]

Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs

Christoph Schuhmann, Robert Kaczmarczyk, Aran Komat- suzaki, Aarush Katta, Richard Vencu, Romain Beaumont, Jenia Jitsev, Theo Coombes, and Clayton Mullis. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. InNeurIPSW, 2021. 5, 12

2021
[58]

Laion-5b: An open large-scale dataset for training next gener- ation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next gener- ation image-text models. InNeurIPS, 2022. 1, 5, 12

2022
[59]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InACL, 2018. 1, 4, 12

2018
[60]

Pre-trained summa- rization distillation.arXiv preprint arXiv:2010.13002, 2020

Sam Shleifer and Alexander M Rush. Pre-trained summa- rization distillation.arXiv preprint arXiv:2010.13002, 2020. 3

work page arXiv 2010
[61]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InCVPR, 2019. 5, 13

2019
[62]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 3 10

work page internal anchor Pith review arXiv 2023
[63]

EVA-CLIP- 18B: Scaling clip to 18 billion parameters.arXiv:2402.04252, 2024

Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, and Xinlong Wang. Eva-clip- 18b: Scaling clip to 18 billion parameters.arXiv preprint arXiv:2402.04252, 2024. 3

work page arXiv 2024
[64]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Mar- tinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Roz- ière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Mimetic initialization of self-attention layers

Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. InICML, 2023. 3

2023
[66]

Efficienttrain: Exploring gener- alized curriculum learning for training visual backbones

Yulin Wang, Yang Yue, Rui Lu, Tianjiao Liu, Zhao Zhong, Shiji Song, and Gao Huang. Efficienttrain: Exploring gener- alized curriculum learning for training visual backbones. In ICCV, 2023. 3

2023
[67]

arXiv preprint arXiv:2407.01445 , year=

Xiyuan Wei, Fanjiang Ye, Ori Yonay, Xingyu Chen, Baixi Sun, Dingwen Tao, and Tianbao Yang. Fastclip: A suite of optimization techniques to accelerate clip training with limited resources.arXiv preprint arXiv:2407.01445, 2024. 1

work page arXiv 2024
[68]

Lotlip: Improving language-image pre-training for long text understanding

Wei Wu, Kecheng Zheng, Shuailei Ma, Fan Lu, Yuxin Guo, Yifei Zhang, Wei Chen, Qingpei Guo, Yujun Shen, and Zheng- Jun Zha. Lotlip: Improving language-image pre-training for long text understanding. InNeurIPS, 2024. 3, 5

2024
[69]

Initializ- ing variable-sized vision transformers from learngene with learnable transformation

Shiyu Xia, Yuankun Zu, Xu Yang, and Xin Geng. Initializ- ing variable-sized vision transformers from learngene with learnable transformation. InNeurIPS, 2024. 3

2024
[70]

San: side adapter network for open-vocabulary semantic segmentation.TPAMI, 2023

Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, and Xiang Bai. San: side adapter network for open-vocabulary semantic segmentation.TPAMI, 2023. 5, 7, 13

2023
[71]

Initializing models with larger ones

Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InICLR, 2024. 3

2024
[72]

Alip: Adaptive language-image pre-training with synthetic caption

Kaicheng Yang, Jiankang Deng, Xiang An, Jiawei Li, Ziyong Feng, Jia Guo, Jing Yang, and Tongliang Liu. Alip: Adaptive language-image pre-training with synthetic caption. InICCV,
[73]

Masked generative distillation

Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked generative distillation. In ECCV, 2022. 3

2022
[74]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. InICLR, 2022. 3

2022
[75]

A gift from knowledge distillation: Fast optimization, network minimization and transfer learning

Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. InCVPR, 2017. 3

2017
[76]

Udon: Universal dynamic online distil- lation for generic image representations

Nikolaos-Antonios Ypsilantis, Kaifeng Chen, André Araujo, and Ondrej Chum. Udon: Universal dynamic online distil- lation for generic image representations. InNeurIPS, 2024. 3

2024
[77]

Scaling vision transformers

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InCVPR, 2022. 8

2022
[78]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InCVPR, 2023. 1

2023
[79]

Deep mutual learning

Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. InCVPR, 2018. 3

2018
[80]

Dreamlip: Language- image pre-training with long captions

Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InECCV, 2024. 1, 3, 4, 5, 12

2024

Showing first 80 references.