pith. sign in

arxiv: 2509.19602 · v2 · submitted 2025-09-23 · 💻 cs.CV

Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

Pith reviewed 2026-05-18 13:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords parameter-efficient fine-tuningmulti-task learningadapter modulestask similarityvision transformersnegative transferprogressive adaptation
0
0 comments X

The pith

Progressive task-specific adaptation shares adapter modules early and specializes them later to enable efficient multi-task learning with reduced interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a parameter-efficient approach for multi-task learning called progressive task-specific multi-task adaptation. Adapter modules are shared across tasks in the early layers of the model and become increasingly task-specific in the later layers. A gradient-based method computes task similarity to decide which tasks share the same adapter modules, aiming to group similar tasks together. This setup is applied to vision transformer models like Swin and Pyramid Vision Transformers for tasks on PASCAL and NYUD-v2 datasets. The method achieves superior performance compared to previous approaches while requiring fewer trainable parameters.

Core claim

By making adapter modules progressively more task-specific from early to late layers and using gradient-based similarity to allocate shared modules to similar tasks, the approach mitigates task interference and negative transfer in multi-task parameter-efficient fine-tuning, leading to better results with reduced parameter counts on semantic segmentation and depth estimation tasks.

What carries the argument

Progressive task-specific adaptation, where shared adapters in early layers transition to task-specific ones in deeper layers, combined with gradient-based task similarity for module allocation.

If this is right

  • Outperforms prior parameter-efficient multi-task methods on PASCAL and NYUD-v2 datasets.
  • Requires fewer trainable parameters than competing approaches.
  • Reduces task interference and negative transfer through similarity-based sharing of adapters.
  • Works effectively when applied to Swin and Pyramid Vision Transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gradient-based task similarity idea might transfer to other parameter-efficient methods such as LoRA or prefix tuning.
  • Early-layer sharing could help scale multi-task training to larger numbers of tasks without linear growth in parameters.
  • The progressive design might generalize to other backbone architectures beyond the vision transformers tested here.

Load-bearing premise

The assumption that gradient-based task similarity computation can reliably allocate similar tasks to shared adapter modules to reduce task interference and negative transfer.

What would settle it

If replacing the gradient-based task allocation with random grouping leads to similar or better performance on the tested datasets, or if the proposed method fails to show gains over baselines while using fewer parameters.

Figures

Figures reproduced from arXiv: 2509.19602 by Anshuka Rangi, Holakou Rahmanian, Neeraj Gangwar, Nickvash Kani, Rishabh Deshmukh, Yesh Dattatreya.

Figure 1
Figure 1. Figure 1: Approaches to adapt a pre-trained model to perform [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of our proposed approach, progressive task [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) TGLoRA Layer with two task groups. T1, T2, and T3 represent three tasks. (b) TGLoRA is added to the attention and MLP layers of the Swin Transformer. where S and L(.) represent the similarity and loss func￾tions. Following Achille et al. [2], we use the cosine simi￾larity, indicated by Scos, between the normalized gradients to compute the similarity as follows S(g, g′ ) = Scos  g |g| + |g ′ | , g ′ |g… view at source ↗
Figure 4
Figure 4. Figure 4: (a) and (b) illustrate the task similarities and computed task groups for PASCAL, respectively, while (c) and (d) present the same [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning exacerbates common issues, such as task interference and negative transfer, due to the limited number of trainable parameters. To address these challenges, we introduce progressive task-specific multi-task adaptation, a novel parameter-efficient approach for multi-task learning. Our approach introduces adapter modules that are shared in early layers and become increasingly task-specific in later layers. Additionally, we propose a gradient-based approach for computing task similarity and use this measure to allocate similar tasks to the shared adapter modules. To evaluate our approach, we adapt Swin and Pyramid Vision Transformers on PASCAL and NYUD-v2. On both datasets, our approach outperforms prior parameter-efficient multi-task methods while using fewer trainable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces progressive task-specific multi-task adaptation for parameter-efficient fine-tuning of vision transformers. Adapter modules are shared across tasks in early layers and become progressively more task-specific in deeper layers; a gradient-based task similarity metric is used to allocate similar tasks to the same shared adapters. The method is evaluated by adapting Swin and Pyramid Vision Transformers to PASCAL and NYUD-v2, with the central claim that it outperforms prior parameter-efficient multi-task baselines while using fewer trainable parameters.

Significance. If the empirical claims hold after proper validation, the work would offer a practical advance in parameter-efficient multi-task learning by explicitly targeting task interference through a progressive sharing schedule and similarity-driven allocation. The design is novel relative to standard adapter or LoRA baselines and could influence efficient multi-task adaptation pipelines in computer vision.

major comments (3)
  1. [§3.3] §3.3 (Gradient-based Task Similarity): The central claim that gradient-based similarity reliably groups tasks to reduce negative transfer lacks isolating ablations. The reported gains could arise from the increased task-specific capacity in later layers rather than the similarity mechanism; without an ablation that disables the similarity allocation while keeping the progressive structure, the load-bearing assumption remains untested.
  2. [§4.1, Table 2] §4.1 and Table 2: The outperformance statement on PASCAL and NYUD-v2 is presented without error bars, multiple random seeds, or statistical significance tests. Given that the abstract already omits all quantitative numbers, the tables must demonstrate that the reported margins are robust and not sensitive to initialization or training order.
  3. [§3.2] §3.2 (Progressive Adapter Allocation): The description of how task similarity is computed from gradients during adaptation does not address potential dominance by high-magnitude tasks or sensitivity to the order in which tasks are presented; a concrete test (e.g., permuting task order or using different initializations) is needed to support the reliability claim.
minor comments (2)
  1. [§3] The notation for the task similarity score (presumably Eq. (3) or (4)) should be defined before its first use in the method section to avoid forward references.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from explicit labels indicating which layers are shared versus task-specific and how the gradient similarity is injected.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major point below and describe the revisions we will make to strengthen the empirical validation and clarity of our method.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Gradient-based Task Similarity): The central claim that gradient-based similarity reliably groups tasks to reduce negative transfer lacks isolating ablations. The reported gains could arise from the increased task-specific capacity in later layers rather than the similarity mechanism; without an ablation that disables the similarity allocation while keeping the progressive structure, the load-bearing assumption remains untested.

    Authors: We agree that an isolating ablation is necessary to separate the contribution of the similarity-based allocation from the progressive capacity increase. In the revised manuscript we will add a controlled ablation that retains the progressive sharing schedule but replaces the gradient-based allocation with random assignment of tasks to shared adapters. We will report the resulting performance drop on both PASCAL and NYUD-v2 to quantify the benefit attributable to the similarity mechanism. revision: yes

  2. Referee: [§4.1, Table 2] §4.1 and Table 2: The outperformance statement on PASCAL and NYUD-v2 is presented without error bars, multiple random seeds, or statistical significance tests. Given that the abstract already omits all quantitative numbers, the tables must demonstrate that the reported margins are robust and not sensitive to initialization or training order.

    Authors: We acknowledge the importance of statistical robustness. In the revision we will rerun all main experiments with three independent random seeds, report mean and standard deviation in Table 2, and include paired t-tests or Wilcoxon tests to establish statistical significance of the observed improvements over the strongest baselines. revision: yes

  3. Referee: [§3.2] §3.2 (Progressive Adapter Allocation): The description of how task similarity is computed from gradients during adaptation does not address potential dominance by high-magnitude tasks or sensitivity to the order in which tasks are presented; a concrete test (e.g., permuting task order or using different initializations) is needed to support the reliability claim.

    Authors: We will expand §3.2 to clarify that gradient similarities are computed after per-task gradient normalization (dividing by the L2 norm of each task’s gradient) to reduce dominance by high-magnitude tasks. To demonstrate stability, we will add a new experiment that permutes task presentation order and repeats the similarity computation under two different adapter initializations, reporting both the resulting task groupings and downstream multi-task performance. revision: yes

Circularity Check

0 steps flagged

No circularity: novel progressive adaptation and gradient-based allocation evaluated empirically

full rationale

The paper proposes a new parameter-efficient multi-task method using progressively task-specific adapters (shared early, specific later) plus a gradient-based task similarity measure to group tasks. These design choices are presented as original contributions and validated through experiments on PASCAL and NYUD-v2 with Swin/PVT backbones, claiming fewer parameters and better performance than prior methods. No equations, self-citations, or fitted inputs are shown reducing the central claims to definitions or prior results by construction. The derivation chain consists of architectural decisions and empirical testing rather than tautological renaming or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; the approach rests on the domain assumption that early network layers capture shareable general features while later layers require task specificity, with no free parameters or invented entities explicitly quantified.

axioms (1)
  • domain assumption Task interference and negative transfer are exacerbated in multi-task learning due to the limited number of trainable parameters.
    Directly stated as the core challenge motivating the progressive adaptation approach.
invented entities (1)
  • progressive task-specific adapter modules no independent evidence
    purpose: Enable sharing in early layers and increasing task specificity in later layers for multi-task adaptation.
    Newly introduced mechanism without mention of prior independent evidence or falsifiable predictions outside the paper.

pith-pipeline@v0.9.0 · 5700 in / 1247 out tokens · 46347 ms · 2026-05-18T13:37:31.662465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Parameter-efficient Quantum Multi-task Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Task2vec: Task embedding for meta-learning

    Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Ste- fano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6430–6439,

  3. [3]

    MT- LoRA: Low-rank adaptation approach for efficient multi- task learning

    Ahmed Agiza, Marina Neseem, and Sherief Reda. MT- LoRA: Low-rank adaptation approach for efficient multi- task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16196– 16205, 2024. 2, 3, 5, 6, 7, 8, 1

  4. [4]

    Sequential modeling enables scalable learn- ing for large vision models

    Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan L Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learn- ing for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024. 1

  5. [5]

    Fair resource allocation in multi-task learning

    Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. InForty-first International Conference on Machine Learning, 2024. 3

  6. [6]

    Automated search for resource- efficient branched multi-task networks

    David Br ¨uggemann, Menelaos Kanakis, Stamatios Geor- goulis, and Luc Van Gool. Automated search for resource- efficient branched multi-task networks. In31st British Machine Vision Conference 2020, BMVC 2020, page 359. BMV A Press, 2020. 2, 3

  7. [7]

    Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022

    Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 2

  8. [8]

    Detect what you can: Detecting and representing objects using holistic mod- els and body parts

    Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fi- dler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic mod- els and body parts. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1971–1978,

  9. [9]

    Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and An- drew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InIn- ternational conference on machine learning, pages 794–803. PMLR, 2018. 3

  10. [10]

    Vision transformer adapter for dense predictions

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023. 2

  11. [11]

    arXiv preprint arXiv:2009.09796 (2020)

    Michael Crawshaw. Multi-task learning with deep neural networks: A survey.arXiv preprint arXiv:2009.09796, 2020. 3

  12. [12]

    The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 2010. 5

  13. [13]

    Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Pro- cessing Systems, 34:27503–27516, 2021

    Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Pro- cessing Systems, 34:27503–27516, 2021. 3, 5

  14. [14]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,

  15. [15]

    Learn- ing to branch for multi-task learning

    Pengsheng Guo, Chen-Yu Lee, and Daniel Ulbricht. Learn- ing to branch for multi-task learning. InInternational confer- ence on machine learning, pages 3854–3863. PMLR, 2020. 2, 3

  16. [16]

    Lora+: Efficient low rank adaptation of large models

    Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. InForty-first Interna- tional Conference on Machine Learning, 2024. 3

  17. [17]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 2, 3

  18. [18]

    LoRA: Low- rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low- rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 2, 3, 5, 7, 8

  19. [19]

    Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

    Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5254–5276, 2023. 2

  20. [20]

    Vi- sual prompt tuning

    Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 2, 3

  21. [21]

    Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491,

  22. [22]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceed- ings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 3045–3059, 2021. 2, 3

  23. [23]

    Prefix-tuning: Optimiz- ing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Confer- ence on Natural Language Processing (Volume 1: Long Pa- pers), pages 4582–4597, 2021. 2, 3

  24. [24]

    Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024

    Yang Lin, Xinyu Ma, Xu Chu, Yujie Jin, Zhibang Yang, Yasha Wang, and Hong Mei. Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024. 3

  25. [25]

    Dora: Weight-decomposed low-rank adaptation

    Shih-yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Ma- chine Learning, 2024. 2, 3

  26. [26]

    Polyhistor: Parameter-efficient multi-task adap- tation for dense vision tasks.Advances in Neural Information Processing Systems, 35:36889–36901, 2022

    Yen-Cheng Liu, Chih-Yao Ma, Junjiao Tian, Zijian He, and Zsolt Kira. Polyhistor: Parameter-efficient multi-task adap- tation for dense vision tasks.Advances in Neural Information Processing Systems, 35:36889–36901, 2022. 2, 3, 5, 6, 7

  27. [27]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 6

  28. [28]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2017. 1

  29. [29]

    Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, and Rog´erio Schmidt Feris. Fully-adaptive fea- ture sharing in multi-task networks with applications in per- son attribute classification.2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 1131– 1140, 2016. 2, 3

  30. [30]

    Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks

    Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa De- hghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)...

  31. [31]

    Attentive single-tasking of multiple tasks

    Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1851–1860, 2019. 3

  32. [32]

    Adapterhub: A framework for adapting transformers.EMNLP 2020, page 46, 2020

    Jonas Pfeiffer, Andreas R ¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vulic, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. Adapterhub: A framework for adapting transformers.EMNLP 2020, page 46, 2020. 3

  33. [33]

    Adapterfusion: Non- destructive task composition for transfer learning

    Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non- destructive task composition for transfer learning. InPro- ceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Vol- ume, pages 487–503, 2021. 2

  34. [34]

    Independent component alignment for multi-task learning

    Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20083–20093, 2023. 3

  35. [35]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InComputer Vision–ECCV 2012: 12th Eu- ropean Conference on Computer Vision, Florence, Italy, Oc- tober 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012. 5

  36. [36]

    Training neu- ral networks with fixed sparse masks.Advances in Neural Information Processing Systems, 34:24193–24205, 2021

    Yi-Lin Sung, Varun Nair, and Colin A Raffel. Training neu- ral networks with fixed sparse masks.Advances in Neural Information Processing Systems, 34:24193–24205, 2021. 2, 3

  37. [37]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1

  38. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

  39. [39]

    Branched multi-task networks: Deciding what layers to share.Proceedings BMVC 2020,

    Simon Vandenhende, Stamatios Georgoulis, Bert De Bra- bandere, and Luc Van Gool. Branched multi-task networks: Deciding what layers to share.Proceedings BMVC 2020,

  40. [40]

    Mti-net: Multi-scale task interaction networks for multi-task learning

    Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Mti-net: Multi-scale task interaction networks for multi-task learning. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 527–543. Springer, 2020. 5

  41. [41]

    Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021

    Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021. 3, 5

  42. [42]

    Deep high-resolution represen- tation learning for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349– 3364, 2020

    Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349– 3364, 2020. 6

  43. [43]

    Adamix: Mixture-of-adaptations for parameter-efficient model tuning

    Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xi- aodong Liu, Jing Gao, Ahmed Hassan, and Jianfeng Gao. Adamix: Mixture-of-adaptations for parameter-efficient model tuning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5744–5760, 2022. 2

  44. [44]

    Parameter-efficient fine-tuning for pre-trained vision models: A survey.arXiv preprint arXiv:2402.02242, 2024

    Yi Xin, Siqi Luo, Haodi Zhou, Junlong Du, Xiaohong Liu, Yue Fan, Qing Li, and Yuntao Du. Parameter-efficient fine-tuning for pre-trained vision models: A survey.arXiv preprint arXiv:2402.02242, 2024. 2

  45. [45]

    Pad-net: Multi-tasks guided prediction-and-distillation net- work for simultaneous depth estimation and scene parsing

    Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation net- work for simultaneous depth estimation and scene parsing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 675–684, 2018. 5

  46. [46]

    Inverted pyramid multi-task trans- former for dense scene understanding

    Hanrong Ye and Dan Xu. Inverted pyramid multi-task trans- former for dense scene understanding. InEuropean Confer- ence on Computer Vision, pages 514–530. Springer, 2022. 5

  47. [47]

    Unleashing the power of multi- task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024

    Jun Yu, Yutong Dai, Xiaokang Liu, Jin Huang, Yishan Shen, Ke Zhang, Rong Zhou, Eashan Adhikarla, Wenx- uan Ye, Yixin Liu, et al. Unleashing the power of multi- task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024. 3

  48. [48]

    Gradient surgery for multi-task learning.Advances in Neural Information Pro- cessing Systems, 33:5824–5836, 2020

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Pro- cessing Systems, 33:5824–5836, 2020. 2

  49. [49]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 1–9, 2022. 3

  50. [50]

    Adaptive budget allocation for parameter-efficient fine- tuning

    Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine- tuning. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2, 3 Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation Supplementary Material

  51. [51]

    [3] for fine-tuning the models on the PASCAL dataset

    Training Hyperparameters PASCAL.We replicate the hyperparameters from Ag- iza et al. [3] for fine-tuning the models on the PASCAL dataset. Specifically, we use the AdamW optimizer [28] with a batch size of 32, a learning rate of 3.125×10 −5, and a weight decay of 0.05. The models are fine-tuned for 300 epochs, with evaluations every 20 epochs. We use a li...

  52. [52]

    Single Task – LoRA

    Additional Results In this section, we present additional experiments on the PASCAL and NYUD-v2 datasets. Continuing from Sec- tion 4.5, Table 5 illustrates the performance of TGLoRA for varying trainable parameters. The rank of the low-rank modules in TGLoRA layers controls this number. The ta- ble also shows the performance of “Single Task – LoRA” and “...

  53. [53]

    Computational Budget vs Performance The tree structure offers a trade-off between model per- formance and inference cost. For instance, using 6.89M parameters for PASCAL-Context, and assigning one, one, two, and four task groups to the first through fourth stages, respectively, results in∆m= +3.93%with 38.37 GMacs. Similarly, configuring the stages with o...