pith. machine review for the scientific record. sign in

arxiv: 2604.09088 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords memory-efficient transfer learningside networksknowledge distillationinference accelerationfine-tuningmasked dual path distillationcomputer visionvision-language models
0
0 comments X

The pith

Mutual distillation during fine-tuning lets side networks be discarded at inference with no accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Masked Dual Path Distillation to fix a drawback in memory-efficient transfer learning. Prior methods use a lightweight side network to adapt large frozen backbones with low training memory, but this adds inference overhead. The new approach runs mutual distillation between the backbone and side network during fine-tuning so that the side network's knowledge is absorbed into the backbone. After training the side network can be removed entirely. A sympathetic reader would care because the result combines low-memory adaptation with fast, memory-light deployment on standard hardware.

Core claim

The paper claims that mutual distillation between a frozen backbone and a learnable side network, combined with a masked dual-path framework and feature-based knowledge distillation for multi-layer encoders, transfers enough knowledge that the side network can be faded away after fine-tuning without any accuracy drop on downstream tasks.

What carries the argument

Masked Dual Path Distillation (MDPD): a training-only framework that runs bidirectional knowledge transfer between the frozen backbone and side network so the side network can be removed at inference while preserving performance.

If this is right

  • Inference runs at least 25.2% faster while parameter count and memory use during fine-tuning stay comparable to existing memory-efficient methods.
  • Accuracy rises above state-of-the-art memory-efficient transfer learning baselines on vision-only, language-only, and vision-language tasks.
  • The same side-network approach works across multiple backbone architectures without extra inference cost.
  • Side networks become training-only components that disappear at deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mutual-distillation pattern could be tested on other training-only auxiliaries such as adapters or prompt modules to see if they too can be removed post-training.
  • Models trained this way would require less specialized inference hardware, since only the original backbone remains.
  • Extending the masked feature distillation to decoder-only or hybrid architectures would be a direct next measurement.

Load-bearing premise

Mutual distillation during fine-tuning transfers enough knowledge from the side network into the frozen backbone so that removing the side network at inference causes no accuracy drop.

What would settle it

Measure accuracy of the distilled backbone alone versus the full backbone-plus-side-network system on a standard benchmark after MDPD training; a clear drop when the side network is removed would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.09088 by Hanwen Zhong, Honglin Chen, Jiaxin Chen, Kaiqi Zheng, Shengcai Liao, Weixin Li, Yunhong Wang, Yutong Zhang.

Figure 1
Figure 1. Figure 1: Illustration on our method. Part (I) shows the pipeline of Masked Dual Path Distillation. During training, the backbone and [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Memory-efficient transfer learning (METL) approaches have recently achieved promising performance in adapting pre-trained models to downstream tasks. They avoid applying gradient backpropagation in large backbones, thus significantly reducing the number of trainable parameters and high memory consumption during fine-tuning. However, since they typically employ a lightweight and learnable side network, these methods inevitably introduce additional memory and time overhead during inference, which contradicts the ultimate goal of efficient transfer learning. To address the above issue, we propose a novel approach dubbed Masked Dual Path Distillation (MDPD) to accelerate inference while retaining parameter and memory efficiency in fine-tuning with fading side networks. Specifically, MDPD develops a framework that enhances the performance by mutually distilling the frozen backbones and learnable side networks in fine-tuning, and discard the side network during inference without sacrificing accuracy. Moreover, we design a novel feature-based knowledge distillation method for the encoder structure with multiple layers. Extensive experiments on distinct backbones across vision/language-only and vision-and-language tasks demonstrate that our method not only accelerates inference by at least 25.2\% while keeping parameter and memory consumption comparable, but also remarkably promotes the accuracy compared to SOTA approaches. The source code is available at https://github.com/Zhang-VKk/MDPD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Masked Dual Path Distillation (MDPD) to improve memory-efficient transfer learning (METL). A frozen pre-trained backbone is paired with a lightweight learnable side network during fine-tuning; mutual distillation (including a novel feature-based method for multi-layer encoders) with masking aligns the two paths so that the side network can be discarded at inference. This is claimed to yield at least 25.2% faster inference with comparable parameter/memory cost during training and no accuracy drop (in fact, accuracy gains versus SOTA). Experiments span vision-only, language-only, and vision-language tasks with multiple backbones; source code is released.

Significance. If the central empirical claim holds, the work directly solves the inference overhead that has limited prior side-network METL methods, delivering both training-time memory savings and deployment-time speed without accuracy trade-offs. The reproducible code and multi-task evaluation are positive factors for adoption in efficient adaptation pipelines.

major comments (3)
  1. [§3] §3 (MDPD framework): the mutual-distillation description does not explain how task-specific refinements learned by the side network are transferred into the frozen backbone. Because the backbone receives no gradients, any feature-based or output-based distillation loss can only update the side network; it is unclear by what mechanism the backbone-alone inference accuracy equals the backbone+side accuracy observed during training.
  2. [§4] §4 (Experiments): the abstract asserts 'without sacrificing accuracy' and 'remarkably promotes the accuracy,' yet the reported tables appear to compare only against external SOTA baselines. Explicit ablations are needed showing (i) backbone-only accuracy after MDPD training versus backbone+side-network accuracy during training, and (ii) versus standard full fine-tuning of the backbone, with statistical significance and the same data splits.
  3. [§3.2] §3.2 (masked dual-path feature distillation): the masking schedule and the precise form of the layer-wise feature loss (Eq. (X)) are not shown to guarantee that the fixed backbone representations become sufficient for the downstream task once the side path is removed; a counter-example or failure case where alignment fails would strengthen the claim.
minor comments (3)
  1. [Abstract] Abstract: the speedup is stated as 'at least 25.2%'; reporting the per-backbone and per-task range (or mean ± std) would make the claim more precise.
  2. [Title / §1] Notation: the distinction between 'fading side networks' in the title and the 'discard the side network' procedure in the text should be unified for clarity.
  3. [§4.1] Implementation details: hyper-parameters for the distillation temperature, masking ratio, and loss weighting are referenced but not tabulated; a supplementary table would aid reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions to the manuscript will be made.

read point-by-point responses
  1. Referee: [§3] §3 (MDPD framework): the mutual-distillation description does not explain how task-specific refinements learned by the side network are transferred into the frozen backbone. Because the backbone receives no gradients, any feature-based or output-based distillation loss can only update the side network; it is unclear by what mechanism the backbone-alone inference accuracy equals the backbone+side accuracy observed during training.

    Authors: We agree that the current description in §3 lacks sufficient detail on the transfer mechanism. The backbone remains frozen throughout training, so gradients from the distillation losses update only the side network. The design relies on the masked dual-path alignment to make the side network's contributions redundant at inference time, allowing the fixed backbone to achieve the reported accuracy. We will revise §3 to explicitly articulate this alignment process and its implications for discarding the side network. revision: yes

  2. Referee: [§4] §4 (Experiments): the abstract asserts 'without sacrificing accuracy' and 'remarkably promotes the accuracy,' yet the reported tables appear to compare only against external SOTA baselines. Explicit ablations are needed showing (i) backbone-only accuracy after MDPD training versus backbone+side-network accuracy during training, and (ii) versus standard full fine-tuning of the backbone, with statistical significance and the same data splits.

    Authors: The referee correctly notes that the existing tables focus on external baselines. We will add the requested ablations to the revised §4, including direct comparisons of backbone-only performance after MDPD training against the joint backbone+side accuracy, as well as against standard full fine-tuning. These will use the same data splits and report means with standard deviations over multiple runs to assess statistical significance. revision: yes

  3. Referee: [§3.2] §3.2 (masked dual-path feature distillation): the masking schedule and the precise form of the layer-wise feature loss (Eq. (X)) are not shown to guarantee that the fixed backbone representations become sufficient for the downstream task once the side path is removed; a counter-example or failure case where alignment fails would strengthen the claim.

    Authors: We acknowledge that additional analysis would help substantiate the guarantee. In the revision we will expand §3.2 with a clearer derivation of the masking schedule and layer-wise loss, together with empirical checks across the evaluated tasks. We will also discuss observed edge cases where alignment is weaker, though we do not currently have a constructed counter-example that demonstrates outright failure. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical distillation method

full rationale

The paper proposes an empirical technique (MDPD) that applies standard feature-based mutual distillation losses between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference. All performance claims (25.2% inference speedup, accuracy gains) rest on experimental results across vision, language, and vision-language tasks rather than any closed-form derivation, parameter fit renamed as prediction, or self-referential equation. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text; the central premise is tested directly via ablation and comparison to SOTA baselines and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions of knowledge distillation and transfer learning; no new physical entities or ad-hoc constants are introduced beyond typical hyperparameters.

axioms (1)
  • domain assumption Knowledge distillation between side network and backbone can embed the side network's learned features into the frozen backbone sufficiently for inference without the side network.
    Invoked in the description of mutual distillation and discarding the side network.

pith-pipeline@v0.9.0 · 5551 in / 1242 out tokens · 36808 ms · 2026-05-10T16:40:36.563527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

122 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    Masked autoencoders enable efficient knowledge distillers

    Yutong Bai, Zeyu Wang, Junfei Xiao, Chen Wei, Huiyun Wang, Alan Yuille, Yuyin Zhou, and Cihang Xie. Masked autoencoders enable efficient knowledge distillers. InPro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 24256–24265, 2023. 2

  2. [2]

    DeepMind Lab

    Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich K ¨uttler, Andrew Lefrancq, Simon Green, V ´ıctor Vald´es, Amir Sadik, et al. Deepmind lab.arXiv preprint arXiv:1612.03801, 2016. 2

  3. [3]

    The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Gi- ampiccolo. The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009. 1

  4. [4]

    Tinytl: Reduce memory, not parameters for efficient on-device learning

    Han Cai, Chuang Gan, Ligeng Zhu, and Song Han. Tinytl: Reduce memory, not parameters for efficient on-device learning. InProceedings of the Advances in Neural Infor- mation Processing Systems, pages 11285–11297, 2020. 2

  5. [5]

    SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Gazpio, and Lucia Specia. Semeval-2017 task 1: Seman- tic textual similarity-multilingual and cross-lingual focused evaluation.arXiv preprint arXiv:1708.00055, 2017. 1

  6. [6]

    Collecting highly par- allel data for paraphrase evaluation

    David Chen and William B Dolan. Collecting highly par- allel data for paraphrase evaluation. InProceedings of the Annual Meeting of the Association for Computational Lin- guistics, pages 190–200, 2011. 5, 1, 3

  7. [7]

    Cross-layer distillation with semantic calibration

    Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Cross-layer distillation with semantic calibration. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 7028–7036, 2021. 2

  8. [8]

    Learning student networks via feature em- bedding.IEEE Transactions on Neural Networks Learning Systems, 32(1):25–35, 2021

    Hanting Chen, Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Learning student networks via feature em- bedding.IEEE Transactions on Neural Networks Learning Systems, 32(1):25–35, 2021. 3

  9. [9]

    Learning the best pooling strategy for vi- sual semantic embedding

    Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. Learning the best pooling strategy for vi- sual semantic embedding. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 15789–15798, 2021. 5, 6, 3

  10. [10]

    Adaptformer: Adapting vision transformers for scalable visual recogni- tion

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recogni- tion. InProceedings of the Advances in Neural Information Processing Systems, pages 16664–16678, 2022. 1, 5

  11. [11]

    Joint homophily and heterophily relational knowledge dis- tillation for efficient and compact 3d object detection

    Shidi Chen, Lili Wei, Liqian Liang, and Congyan Lang. Joint homophily and heterophily relational knowledge dis- tillation for efficient and compact 3d object detection. In Proceedings of the ACM International Conference on Mul- timedia, pages 2127–2135, 2024. 3

  12. [12]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016. 2

  13. [13]

    Vision transformer adapter for dense predictions,

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions.arXiv preprint arXiv:2205.08534, 2022. 1

  14. [14]

    Das-sam: fine-tuning sam to- wards drivable area segmentation via efficient multi-scale traffic scene-aware adaptation.Visual Intelligence, 4(1):6,

    Zhenghao Chen, Nan Zhou, Yi Fan, Lina Zhou, Yubao Xie, Jiaxin Chen, and Di Huang. Das-sam: fine-tuning sam to- wards drivable area segmentation via efficient multi-scale traffic scene-aware adaptation.Visual Intelligence, 4(1):6,

  15. [15]

    Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sens- ing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017. 2

  16. [16]

    Describing tex- tures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing tex- tures in the wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3606–3613,

  17. [17]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 248–255, 2009. 3

  18. [18]

    Bert: Pre-training of deep bidirectional trans- formers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the North American Chapter of the Association for Computa- tional Linguistics, pages 4171–4186, 2019. 2, 3

  19. [19]

    Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832, 2024

    Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, and Xinlong Wang. Unveiling encoder-free vision-language models.arXiv preprint arXiv:2406.11832,

  20. [20]

    Sherl: Synthesizing high accuracy and efficient memory for resource-limited transfer learning

    Haiwen Diao, Bo Wan, Xu Jia, Yunzhi Zhuge, Ying Zhang, Huchuan Lu, and Long Chen. Sherl: Synthesizing high accuracy and efficient memory for resource-limited transfer learning. InProceedings of the European Conference on Computer Vision, pages 75–95, 2024. 1, 2, 5, 6, 7, 8

  21. [21]

    Unipt: Universal parallel tuning for trans- fer learning with efficient parameter and memory

    Haiwen Diao, Bo Wan, Ying Zhang, Xu Jia, Huchuan Lu, and Long Chen. Unipt: Universal parallel tuning for trans- fer learning with efficient parameter and memory. InPro- ceedings of the Computer Vision and Pattern Recognition, pages 28729–28740, 2024. 1, 2, 5, 6

  22. [22]

    Automatically constructing a corpus of sentential paraphrases

    Bill Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. InThird international workshop on paraphrasing (IWP2005), 2005. 1

  23. [23]

    An im- age is worth 16x16 words: Transformers for image recog- nition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An im- age is worth 16x16 words: Transformers for image recog- nition at scale. InProceedings of the International Confer- ence on Learning Represen...

  24. [24]

    Eva: Exploring the limits of masked visual represen- tation learning at scale

    Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. InProceedings of the Computer Vision and Pattern Recognition, pages 19358–19369, 2023. 1

  25. [25]

    One-shot learning of object categories.IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006

    Li Fei-Fei, Robert Fergus, and Pietro Perona. One-shot learning of object categories.IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4):594–611, 2006. 2

  26. [26]

    Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237,

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The Inter- national Journal of Robotics Research, 32(11):1231–1237,

  27. [27]

    Imagebind one embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. Imagebind one embedding space to bind them all. InProceedings of the Computer Vision and Pattern Recog- nition, pages 15180–15190, 2023. 2

  28. [28]

    The reversible residual network: Back- propagation without storing activations

    Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual network: Back- propagation without storing activations. InProceedings of the Advances in Neural Information Processing Systems,

  29. [29]

    Making the v in vqa matter: El- evating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: El- evating the role of image understanding in visual question answering. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 6904–6913, 2017. 5, 1, 3

  30. [30]

    Kaggle diabetic retinopathy detection com- petition report.University of Warwick, 22(9), 2015

    Ben Graham. Kaggle diabetic retinopathy detection com- petition report.University of Warwick, 22(9), 2015. 2

  31. [31]

    Parameter-efficient transfer learning with diff pruning,

    Demi Guo, Alexander M Rush, and Yoon Kim. Parameter- efficient transfer learning with diff pruning.arXiv preprint arXiv:2012.07463, 2020. 1

  32. [32]

    Warp: Word-level adversarial reprogramming

    Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. Warp: Word-level adversarial reprogramming. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2021. 2

  33. [33]

    Sensitivity-aware visual parameter-efficient fine-tuning

    Haoyu He, Jianfei Cai, Jing Zhang, Dacheng Tao, and Bo- han Zhuang. Sensitivity-aware visual parameter-efficient fine-tuning. InProceedings of the IEEE International Con- ference on Computer Vision, pages 11825–11835, 2023. 2

  34. [34]

    Towards a unified view of parameter-efficient transfer learning

    Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. InProceedings of the International Conference on Learning Representations,

  35. [35]

    Masked autoencoders are scal- able vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InProceedings of the Computer Vision and Pattern Recognition, pages 16000–16009, 2022. 1

  36. [36]

    Sparseadapter: An easy approach for im- proving the parameter-efficiency of adapters

    Shwai He, Liang Ding, Daize Dong, Miao Zhang, and Dacheng Tao. Sparseadapter: An easy approach for im- proving the parameter-efficiency of adapters. InProceed- ings of the Conference on Empirical Methodsin Natural Language Processing, pages 2184–2190, 2022. 2

  37. [37]

    Parameter-efficient model adaptation for vision transformers

    Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, and Xin Eric Wang. Parameter-efficient model adaptation for vision transformers. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 817–825, 2023. 2

  38. [38]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 2

  39. [39]

    Distilling the knowledge in a neural net- work.arXiv preprint arXiv:.02531, 2015

    Geoffrey Hinton. Distilling the knowledge in a neural net- work.arXiv preprint arXiv:.02531, 2015. 2

  40. [40]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InProceedings of the Interna- tional Conference on Machine Learning, pages 2790–2799,

  41. [41]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In Proceedings of the International Conference on Learning Representations, 2022. 1, 2, 5, 6, 3

  42. [42]

    Gqa: A new dataset for real-world visual reasoning and composi- tional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and composi- tional question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6700– 6709, 2019. 5, 1, 3

  43. [43]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InProceedings of the European Con- ference on Computer Vision, pages 709–727, 2022. 1, 5

  44. [44]

    Res- tuning: A flexible and efficient tuning paradigm via unbind- ing tuner from backbone

    Zeyinzi Jiang, Chaojie Mao, Ziyuan Huang, Ao Ma, Yil- iang Lv, Yujun Shen, Deli Zhao, and Jingren Zhou. Res- tuning: A flexible and efficient tuning paradigm via unbind- ing tuner from backbone. InProceedings of the Advances in Neural Information Processing Systems, 2024. 2, 5

  45. [45]

    Fact: Factor-tuning for lightweight adaptation on vision transformer

    Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1060–1068, 2023. 5, 6

  46. [46]

    Convolutional bypasses are better vision transformer adapters

    Shibo Jie, Zhi-Hong Deng, Shixuan Chen, and Zhijuan Jin. Convolutional bypasses are better vision transformer adapters. InProceedings of the European Conference on Artificial Intelligence, pages 202–209. 2024. 5

  47. [47]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 2901–2910, 2017. 2

  48. [48]

    Mdetr- modulated detection for end-to-end multi-modal under- standing

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal under- standing. InProceedings of the IEEE International Con- ference on Computer Vision, pages 1780–1790, 2021. 6, 3

  49. [49]

    How to adapt your large-scale vision-and-language model

    Konwoo Kim, Michael Laskin, Igor Mordatch, and Deepak Pathak. How to adapt your large-scale vision-and-language model. InProceedings of the International Conference on Learning Representations, 2021. 2

  50. [50]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. 2

  51. [51]

    Learning methods for generic object recognition with invariance to pose and lighting

    Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages II–104, 2004. 2

  52. [52]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021. 2, 3

  53. [53]

    Multimodal inplace prompt tuning for open-set object detection

    Guilin Li, Mengdan Zhang, Xiawu Zheng, Peixian Chen, Zihan Wang, Yunhang Shen, Mingchen Zhuge, Chenglin Wu, Fei Chao, Ke Li, et al. Multimodal inplace prompt tuning for open-set object detection. InProceedings of the ACM International Conference on Multimedia, pages 8062–8071, 2024. 1

  54. [54]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, pages 19730–19742, 2023. 1

  55. [55]

    Prefix-tuning: Optimiz- ing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the Annual Meeting of the Association for Computational Linguistics, pages 4582–4597, 2021. 1, 2, 5, 6

  56. [56]

    Scaling & shifting your features: A new baseline for efficient model tuning

    Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. InProceedings of the Advances in Neural Information Processing Systems, pages 109–123. 5, 6

  57. [57]

    Simple yet effective: Structure guided pre- trained transformer for multi-modal knowledge graph rea- soning

    Ke Liang, Lingyuan Meng, Yue Liu, Meng Liu, Wei Wei, Suyuan Liu, Wenxuan Tu, Siwei Wang, Sihang Zhou, and Xinwang Liu. Simple yet effective: Structure guided pre- trained transformer for multi-modal knowledge graph rea- soning. InProceedings of the ACM International Confer- ence on Multimedia, pages 1554–1563, 2024. 1

  58. [58]

    Make pre- trained model reversible: From parameter to memory effi- cient fine-tuning

    Baohao Liao, Shaomu Tan, and Christof Monz. Make pre- trained model reversible: From parameter to memory effi- cient fine-tuning. InProceedings of the Advances in Neural Information Processing Systems, 2023. 2

  59. [59]

    Uni- dllora: Style fine-tuning for fashion image translation

    Fangjian Liao, Xingxing Zou, and Waikeung Wong. Uni- dllora: Style fine-tuning for fashion image translation. In Proceedings of the ACM International Conference on Mul- timedia, pages 6404–6413, 2024. 1, 6, 7, 8

  60. [60]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InProceedings of the European Conference on Computer Vision, pages 740–755, 2014. 5, 1, 3

  61. [61]

    Hierarchical side-tuning for vision transformers.arXiv preprint arXiv:2310.05393,

    Weifeng Lin, Ziheng Wu, Wentao Yang, Mingxin Huang, Jun Huang, and Lianwen Jin. Hierarchical side-tuning for vision transformers.arXiv preprint arXiv:2310.05393,

  62. [62]

    Roberta: A ro- bustly optimized bert pretraining approach.arXiv preprint arXiv:.11692, 364, 2019

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A ro- bustly optimized bert pretraining approach.arXiv preprint arXiv:.11692, 364, 2019. 2, 3

  63. [63]

    Y-tuning: An efficient tuning paradigm for large-scale pre-trained models via label representation learning.Frontiers of Computer Science, 18(4):184320, 2024

    Yitao Liu, Chenxin An, and Xipeng Qiu. Y-tuning: An efficient tuning paradigm for large-scale pre-trained models via label representation learning.Frontiers of Computer Science, 18(4):184320, 2024. 1

  64. [64]

    Clip4clip: An empirical study of CLIP for end to end video clip retrieval.CoRR, abs/2104.08860, 2021

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval.arXiv preprint arXiv:2104.08860, 2021. 5, 6, 3

  65. [65]

    Fine-tuning language models with just forward passes

    Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. In Proceedings of the Advances in Neural Information Pro- cessing Systems, pages 53038–53075, 2023. 2

  66. [66]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the Computer Vision and Pattern Recogni- tion, pages 11–20, 2016. 5, 1, 3

  67. [67]

    Condi- tional teacher-student learning

    Zhong Meng, Jinyu Li, Yong Zhao, and Yifan Gong. Condi- tional teacher-student learning. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6445–6449, 2019. 2

  68. [68]

    Time-memory-and parameter- efficient visual adaptation

    Otniel-Bogdan Mercea, Alexey Gritsenko, Cordelia Schmid, and Anurag Arnab. Time-memory-and parameter- efficient visual adaptation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5536– 5545, 2024. 1, 6

  69. [69]

    Reading digits in natural images with unsupervised feature learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Baolin Wu, Andrew Y Ng, et al. Reading digits in natural images with unsupervised feature learning. InNIPS workshop on deep learning and unsupervised feature learn- ing, page 7. Granada, 2011. 2

  70. [70]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008. 2

  71. [71]

    Disentangled-multimodal privileged knowledge distillation for depression recognition with incomplete multimodal data

    Yuchen Pan, Junjun Jiang, Kui Jiang, and Xianming Liu. Disentangled-multimodal privileged knowledge distillation for depression recognition with incomplete multimodal data. InProceedings of the ACM International Conference on Multimedia, pages 5712–5721, 2024. 2

  72. [72]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 3498–3505. IEEE, 2012. 2

  73. [73]

    Hypertuning: Toward adapting large language models with- out back-propagation

    Jason Phang, Yi Mao, Pengcheng He, and Weizhu Chen. Hypertuning: Toward adapting large language models with- out back-propagation. InProceedings of the International Conference on Machine Learning, pages 27854–27875,

  74. [74]

    Exploring universal intrinsic task subspace via prompt tuning

    Yujia Qin, Xiaozhi Wang, Yusheng Su, Yankai Lin, Ning Ding, Jing Yi, Weize Chen, Zhiyuan Liu, Juanzi Li, and Lei Hou. Exploring universal intrinsic task subspace via prompt tuning. InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2022. 2

  75. [75]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9,

  76. [76]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. Learn- ing transferable visual models from natural language super- vision. InProceedings of the International Conference on Machine Learning, pages 8748–8763, 2021. 2, 3

  77. [77]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. 5, 3

  78. [78]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine com- prehension of text.arXiv preprint arXiv:1606.05250, 2016. 1

  79. [79]

    Fit- nets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets. InProceedings of the Inter- national Conference on Learning Representations, 2015. 2

  80. [80]

    How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks?arXiv preprint arXiv:2107.06383, 2021. 6, 3

Showing first 80 references.