Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

Anshuka Rangi; Holakou Rahmanian; Neeraj Gangwar; Nickvash Kani; Rishabh Deshmukh; Yesh Dattatreya

arxiv: 2509.19602 · v2 · submitted 2025-09-23 · 💻 cs.CV

Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

Neeraj Gangwar , Anshuka Rangi , Rishabh Deshmukh , Holakou Rahmanian , Yesh Dattatreya , Nickvash Kani This is my paper

Pith reviewed 2026-05-18 13:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords parameter-efficient fine-tuningmulti-task learningadapter modulestask similarityvision transformersnegative transferprogressive adaptation

0 comments

The pith

Progressive task-specific adaptation shares adapter modules early and specializes them later to enable efficient multi-task learning with reduced interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a parameter-efficient approach for multi-task learning called progressive task-specific multi-task adaptation. Adapter modules are shared across tasks in the early layers of the model and become increasingly task-specific in the later layers. A gradient-based method computes task similarity to decide which tasks share the same adapter modules, aiming to group similar tasks together. This setup is applied to vision transformer models like Swin and Pyramid Vision Transformers for tasks on PASCAL and NYUD-v2 datasets. The method achieves superior performance compared to previous approaches while requiring fewer trainable parameters.

Core claim

By making adapter modules progressively more task-specific from early to late layers and using gradient-based similarity to allocate shared modules to similar tasks, the approach mitigates task interference and negative transfer in multi-task parameter-efficient fine-tuning, leading to better results with reduced parameter counts on semantic segmentation and depth estimation tasks.

What carries the argument

Progressive task-specific adaptation, where shared adapters in early layers transition to task-specific ones in deeper layers, combined with gradient-based task similarity for module allocation.

If this is right

Outperforms prior parameter-efficient multi-task methods on PASCAL and NYUD-v2 datasets.
Requires fewer trainable parameters than competing approaches.
Reduces task interference and negative transfer through similarity-based sharing of adapters.
Works effectively when applied to Swin and Pyramid Vision Transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The gradient-based task similarity idea might transfer to other parameter-efficient methods such as LoRA or prefix tuning.
Early-layer sharing could help scale multi-task training to larger numbers of tasks without linear growth in parameters.
The progressive design might generalize to other backbone architectures beyond the vision transformers tested here.

Load-bearing premise

The assumption that gradient-based task similarity computation can reliably allocate similar tasks to shared adapter modules to reduce task interference and negative transfer.

What would settle it

If replacing the gradient-based task allocation with random grouping leads to similar or better performance on the tested datasets, or if the proposed method fails to show gains over baselines while using fewer parameters.

Figures

Figures reproduced from arXiv: 2509.19602 by Anshuka Rangi, Holakou Rahmanian, Neeraj Gangwar, Nickvash Kani, Rishabh Deshmukh, Yesh Dattatreya.

**Figure 2.** Figure 2: Comparison of our proposed approach, progressive task [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (a) TGLoRA Layer with two task groups. T1, T2, and T3 represent three tasks. (b) TGLoRA is added to the attention and MLP layers of the Swin Transformer. where S and L(.) represent the similarity and loss functions. Following Achille et al. [2], we use the cosine similarity, indicated by Scos, between the normalized gradients to compute the similarity as follows S(g, g′ ) = Scos g |g| + |g ′ | , g ′ |g… view at source ↗

**Figure 4.** Figure 4: (a) and (b) illustrate the task similarities and computed task groups for PASCAL, respectively, while (c) and (d) present the same [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning exacerbates common issues, such as task interference and negative transfer, due to the limited number of trainable parameters. To address these challenges, we introduce progressive task-specific multi-task adaptation, a novel parameter-efficient approach for multi-task learning. Our approach introduces adapter modules that are shared in early layers and become increasingly task-specific in later layers. Additionally, we propose a gradient-based approach for computing task similarity and use this measure to allocate similar tasks to the shared adapter modules. To evaluate our approach, we adapt Swin and Pyramid Vision Transformers on PASCAL and NYUD-v2. On both datasets, our approach outperforms prior parameter-efficient multi-task methods while using fewer trainable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes progressive layer-wise adapters plus gradient-based task similarity for multi-task PEFT in vision transformers, but the abstract supplies no numbers or ablations to show the gains are real.

read the letter

The main thing to know is that this work combines a progressive adapter design—shared modules in early layers that grow more task-specific later—with a gradient-based way to measure task similarity and decide which tasks share adapters. That setup aims to cut negative transfer in multi-task parameter-efficient fine-tuning without blowing up the trainable parameter count. They test it by adapting Swin and Pyramid Vision Transformers on PASCAL and NYUD-v2 and claim it beats prior PEFT multi-task baselines while using fewer parameters. If the full experiments back this up, the idea is a straightforward extension of existing adapter work that directly targets interference, which is a practical pain point when you have limited capacity across related vision tasks. The gradient similarity measure is a reasonable choice because it can pick up on training dynamics rather than relying on hand-crafted affinities. The paper does a clean job of motivating the problem and laying out the design choices without overclaiming novelty beyond the specific combination. The soft spots are mostly around evidence. The abstract gives no quantitative results, no baselines listed, no error bars, and no implementation details, so it is impossible to tell how large the gains are or whether they survive stronger controls. The stress-test point about noisy gradients or dominant tasks skewing the similarity allocation is worth checking; without ablations that isolate the allocation step from the simple fact of adding more task-specific capacity in later layers, it is easy to imagine the benefit coming from the progressive structure alone. Evaluation is also narrow—only two datasets and two transformer backbones—so robustness to different initializations or task orders remains open. This is the kind of paper that would interest people working on efficient multi-task adaptation for vision transformers who already know the adapter literature. A reader who wants a modest but concrete design tweak could get something useful out of the full version if the numbers and ablations are solid. It deserves a serious referee because the core idea is motivated and the method is easy to reproduce, even though the current presentation is thin on data. I would send it to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces progressive task-specific multi-task adaptation for parameter-efficient fine-tuning of vision transformers. Adapter modules are shared across tasks in early layers and become progressively more task-specific in deeper layers; a gradient-based task similarity metric is used to allocate similar tasks to the same shared adapters. The method is evaluated by adapting Swin and Pyramid Vision Transformers to PASCAL and NYUD-v2, with the central claim that it outperforms prior parameter-efficient multi-task baselines while using fewer trainable parameters.

Significance. If the empirical claims hold after proper validation, the work would offer a practical advance in parameter-efficient multi-task learning by explicitly targeting task interference through a progressive sharing schedule and similarity-driven allocation. The design is novel relative to standard adapter or LoRA baselines and could influence efficient multi-task adaptation pipelines in computer vision.

major comments (3)

[§3.3] §3.3 (Gradient-based Task Similarity): The central claim that gradient-based similarity reliably groups tasks to reduce negative transfer lacks isolating ablations. The reported gains could arise from the increased task-specific capacity in later layers rather than the similarity mechanism; without an ablation that disables the similarity allocation while keeping the progressive structure, the load-bearing assumption remains untested.
[§4.1, Table 2] §4.1 and Table 2: The outperformance statement on PASCAL and NYUD-v2 is presented without error bars, multiple random seeds, or statistical significance tests. Given that the abstract already omits all quantitative numbers, the tables must demonstrate that the reported margins are robust and not sensitive to initialization or training order.
[§3.2] §3.2 (Progressive Adapter Allocation): The description of how task similarity is computed from gradients during adaptation does not address potential dominance by high-magnitude tasks or sensitivity to the order in which tasks are presented; a concrete test (e.g., permuting task order or using different initializations) is needed to support the reliability claim.

minor comments (2)

[§3] The notation for the task similarity score (presumably Eq. (3) or (4)) should be defined before its first use in the method section to avoid forward references.
[Figure 2] Figure 2 (architecture diagram) would benefit from explicit labels indicating which layers are shared versus task-specific and how the gradient similarity is injected.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments on our manuscript. We address each major point below and describe the revisions we will make to strengthen the empirical validation and clarity of our method.

read point-by-point responses

Referee: [§3.3] §3.3 (Gradient-based Task Similarity): The central claim that gradient-based similarity reliably groups tasks to reduce negative transfer lacks isolating ablations. The reported gains could arise from the increased task-specific capacity in later layers rather than the similarity mechanism; without an ablation that disables the similarity allocation while keeping the progressive structure, the load-bearing assumption remains untested.

Authors: We agree that an isolating ablation is necessary to separate the contribution of the similarity-based allocation from the progressive capacity increase. In the revised manuscript we will add a controlled ablation that retains the progressive sharing schedule but replaces the gradient-based allocation with random assignment of tasks to shared adapters. We will report the resulting performance drop on both PASCAL and NYUD-v2 to quantify the benefit attributable to the similarity mechanism. revision: yes
Referee: [§4.1, Table 2] §4.1 and Table 2: The outperformance statement on PASCAL and NYUD-v2 is presented without error bars, multiple random seeds, or statistical significance tests. Given that the abstract already omits all quantitative numbers, the tables must demonstrate that the reported margins are robust and not sensitive to initialization or training order.

Authors: We acknowledge the importance of statistical robustness. In the revision we will rerun all main experiments with three independent random seeds, report mean and standard deviation in Table 2, and include paired t-tests or Wilcoxon tests to establish statistical significance of the observed improvements over the strongest baselines. revision: yes
Referee: [§3.2] §3.2 (Progressive Adapter Allocation): The description of how task similarity is computed from gradients during adaptation does not address potential dominance by high-magnitude tasks or sensitivity to the order in which tasks are presented; a concrete test (e.g., permuting task order or using different initializations) is needed to support the reliability claim.

Authors: We will expand §3.2 to clarify that gradient similarities are computed after per-task gradient normalization (dividing by the L2 norm of each task’s gradient) to reduce dominance by high-magnitude tasks. To demonstrate stability, we will add a new experiment that permutes task presentation order and repeats the similarity computation under two different adapter initializations, reporting both the resulting task groupings and downstream multi-task performance. revision: yes

Circularity Check

0 steps flagged

No circularity: novel progressive adaptation and gradient-based allocation evaluated empirically

full rationale

The paper proposes a new parameter-efficient multi-task method using progressively task-specific adapters (shared early, specific later) plus a gradient-based task similarity measure to group tasks. These design choices are presented as original contributions and validated through experiments on PASCAL and NYUD-v2 with Swin/PVT backbones, claiming fewer parameters and better performance than prior methods. No equations, self-citations, or fitted inputs are shown reducing the central claims to definitions or prior results by construction. The derivation chain consists of architectural decisions and empirical testing rather than tautological renaming or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; the approach rests on the domain assumption that early network layers capture shareable general features while later layers require task specificity, with no free parameters or invented entities explicitly quantified.

axioms (1)

domain assumption Task interference and negative transfer are exacerbated in multi-task learning due to the limited number of trainable parameters.
Directly stated as the core challenge motivating the progressive adaptation approach.

invented entities (1)

progressive task-specific adapter modules no independent evidence
purpose: Enable sharing in early layers and increasing task specificity in later layers for multi-task adaptation.
Newly introduced mechanism without mention of prior independent evidence or falsifiable predictions outside the paper.

pith-pipeline@v0.9.0 · 5700 in / 1247 out tokens · 46347 ms · 2026-05-18T13:37:31.662465+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

adapter modules are shared among all tasks in the early layers, and they become increasingly specific to a subset of tasks as we move toward task-specific decoders... gradient-based approach for computing task similarity
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use the notion of gradient conflicts from the MTL literature to compute the similarity between a pair of tasks... S(g,g') = S_cos(...)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Parameter-efficient Quantum Multi-task Learning
cs.LG 2026-04 unverdicted novelty 6.0

QMTL uses shared VQC encoding plus task-specific quantum ansatz heads to achieve linear parameter scaling with the number of tasks while matching or exceeding classical multi-task baselines on three benchmarks.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Task2vec: Task embedding for meta-learning

Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Ste- fano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6430–6439,

work page
[3]

MT- LoRA: Low-rank adaptation approach for efficient multi- task learning

Ahmed Agiza, Marina Neseem, and Sherief Reda. MT- LoRA: Low-rank adaptation approach for efficient multi- task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16196– 16205, 2024. 2, 3, 5, 6, 7, 8, 1

work page 2024
[4]

Sequential modeling enables scalable learn- ing for large vision models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan L Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learn- ing for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024. 1

work page 2024
[5]

Fair resource allocation in multi-task learning

Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. InForty-first International Conference on Machine Learning, 2024. 3

work page 2024
[6]

Automated search for resource- efficient branched multi-task networks

David Br ¨uggemann, Menelaos Kanakis, Stamatios Geor- goulis, and Luc Van Gool. Automated search for resource- efficient branched multi-task networks. In31st British Machine Vision Conference 2020, BMVC 2020, page 359. BMV A Press, 2020. 2, 3

work page 2020
[7]

Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 2

work page 2022
[8]

Detect what you can: Detecting and representing objects using holistic mod- els and body parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fi- dler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic mod- els and body parts. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1971–1978,

work page 1971
[9]

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and An- drew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InIn- ternational conference on machine learning, pages 794–803. PMLR, 2018. 3

work page 2018
[10]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023
[11]

arXiv preprint arXiv:2009.09796 (2020)

Michael Crawshaw. Multi-task learning with deep neural networks: A survey.arXiv preprint arXiv:2009.09796, 2020. 3

work page arXiv 2009
[12]

The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 2010. 5

work page 2010
[13]

Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Pro- cessing Systems, 34:27503–27516, 2021

Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Pro- cessing Systems, 34:27503–27516, 2021. 3, 5

work page 2021
[14]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Learn- ing to branch for multi-task learning

Pengsheng Guo, Chen-Yu Lee, and Daniel Ulbricht. Learn- ing to branch for multi-task learning. InInternational confer- ence on machine learning, pages 3854–3863. PMLR, 2020. 2, 3

work page 2020
[16]

Lora+: Efficient low rank adaptation of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. InForty-first Interna- tional Conference on Machine Learning, 2024. 3

work page 2024
[17]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 2, 3

work page 2019
[18]

LoRA: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low- rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 2, 3, 5, 7, 8

work page 2022
[19]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5254–5276, 2023. 2

work page 2023
[20]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 2, 3

work page 2022
[21]

Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491,

work page
[22]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceed- ings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 3045–3059, 2021. 2, 3

work page 2021
[23]

Prefix-tuning: Optimiz- ing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Confer- ence on Natural Language Processing (Volume 1: Long Pa- pers), pages 4582–4597, 2021. 2, 3

work page 2021
[24]

Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024

Yang Lin, Xinyu Ma, Xu Chu, Yujie Jin, Zhibang Yang, Yasha Wang, and Hong Mei. Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024. 3

work page arXiv 2024
[25]

Dora: Weight-decomposed low-rank adaptation

Shih-yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Ma- chine Learning, 2024. 2, 3

work page 2024
[26]

Polyhistor: Parameter-efficient multi-task adap- tation for dense vision tasks.Advances in Neural Information Processing Systems, 35:36889–36901, 2022

Yen-Cheng Liu, Chih-Yao Ma, Junjiao Tian, Zijian He, and Zsolt Kira. Polyhistor: Parameter-efficient multi-task adap- tation for dense vision tasks.Advances in Neural Information Processing Systems, 35:36889–36901, 2022. 2, 3, 5, 6, 7

work page 2022
[27]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 6

work page 2021
[28]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2017. 1

work page 2017
[29]

Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, and Rog´erio Schmidt Feris. Fully-adaptive fea- ture sharing in multi-task networks with applications in per- son attribute classification.2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 1131– 1140, 2016. 2, 3

work page 2017
[30]

Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks

Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa De- hghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)...

work page 2021
[31]

Attentive single-tasking of multiple tasks

Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1851–1860, 2019. 3

work page 2019
[32]

Adapterhub: A framework for adapting transformers.EMNLP 2020, page 46, 2020

Jonas Pfeiffer, Andreas R ¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vulic, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. Adapterhub: A framework for adapting transformers.EMNLP 2020, page 46, 2020. 3

work page 2020
[33]

Adapterfusion: Non- destructive task composition for transfer learning

Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non- destructive task composition for transfer learning. InPro- ceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Vol- ume, pages 487–503, 2021. 2

work page 2021
[34]

Independent component alignment for multi-task learning

Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20083–20093, 2023. 3

work page 2023
[35]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InComputer Vision–ECCV 2012: 12th Eu- ropean Conference on Computer Vision, Florence, Italy, Oc- tober 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012. 5

work page 2012
[36]

Training neu- ral networks with fixed sparse masks.Advances in Neural Information Processing Systems, 34:24193–24205, 2021

Yi-Lin Sung, Varun Nair, and Colin A Raffel. Training neu- ral networks with fixed sparse masks.Advances in Neural Information Processing Systems, 34:24193–24205, 2021. 2, 3

work page 2021
[37]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Branched multi-task networks: Deciding what layers to share.Proceedings BMVC 2020,

Simon Vandenhende, Stamatios Georgoulis, Bert De Bra- bandere, and Luc Van Gool. Branched multi-task networks: Deciding what layers to share.Proceedings BMVC 2020,

work page 2020
[40]

Mti-net: Multi-scale task interaction networks for multi-task learning

Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Mti-net: Multi-scale task interaction networks for multi-task learning. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 527–543. Springer, 2020. 5

work page 2020
[41]

Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021

Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021. 3, 5

work page 2021
[42]

Deep high-resolution represen- tation learning for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349– 3364, 2020

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349– 3364, 2020. 6

work page 2020
[43]

Adamix: Mixture-of-adaptations for parameter-efficient model tuning

Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xi- aodong Liu, Jing Gao, Ahmed Hassan, and Jianfeng Gao. Adamix: Mixture-of-adaptations for parameter-efficient model tuning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5744–5760, 2022. 2

work page 2022
[44]

Parameter-efficient fine-tuning for pre-trained vision models: A survey.arXiv preprint arXiv:2402.02242, 2024

Yi Xin, Siqi Luo, Haodi Zhou, Junlong Du, Xiaohong Liu, Yue Fan, Qing Li, and Yuntao Du. Parameter-efficient fine-tuning for pre-trained vision models: A survey.arXiv preprint arXiv:2402.02242, 2024. 2

work page arXiv 2024
[45]

Pad-net: Multi-tasks guided prediction-and-distillation net- work for simultaneous depth estimation and scene parsing

Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation net- work for simultaneous depth estimation and scene parsing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 675–684, 2018. 5

work page 2018
[46]

Inverted pyramid multi-task trans- former for dense scene understanding

Hanrong Ye and Dan Xu. Inverted pyramid multi-task trans- former for dense scene understanding. InEuropean Confer- ence on Computer Vision, pages 514–530. Springer, 2022. 5

work page 2022
[47]

Unleashing the power of multi- task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024

Jun Yu, Yutong Dai, Xiaokang Liu, Jin Huang, Yishan Shen, Ke Zhang, Rong Zhou, Eashan Adhikarla, Wenx- uan Ye, Yixin Liu, et al. Unleashing the power of multi- task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024. 3

work page arXiv 2024
[48]

Gradient surgery for multi-task learning.Advances in Neural Information Pro- cessing Systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Pro- cessing Systems, 33:5824–5836, 2020. 2

work page 2020
[49]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 1–9, 2022. 3

work page 2022
[50]

Adaptive budget allocation for parameter-efficient fine- tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine- tuning. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2, 3 Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation Supplementary Material

work page 2023
[51]

[3] for fine-tuning the models on the PASCAL dataset

Training Hyperparameters PASCAL.We replicate the hyperparameters from Ag- iza et al. [3] for fine-tuning the models on the PASCAL dataset. Specifically, we use the AdamW optimizer [28] with a batch size of 32, a learning rate of 3.125×10 −5, and a weight decay of 0.05. The models are fine-tuned for 300 epochs, with evaluations every 20 epochs. We use a li...

work page
[52]

Single Task – LoRA

Additional Results In this section, we present additional experiments on the PASCAL and NYUD-v2 datasets. Continuing from Sec- tion 4.5, Table 5 illustrates the performance of TGLoRA for varying trainable parameters. The rank of the low-rank modules in TGLoRA layers controls this number. The ta- ble also shows the performance of “Single Task – LoRA” and “...

work page
[53]

Computational Budget vs Performance The tree structure offers a trade-off between model per- formance and inference cost. For instance, using 6.89M parameters for PASCAL-Context, and assigning one, one, two, and four task groups to the first through fourth stages, respectively, results in∆m= +3.93%with 38.37 GMacs. Similarly, configuring the stages with o...

work page

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Task2vec: Task embedding for meta-learning

Alessandro Achille, Michael Lam, Rahul Tewari, Avinash Ravichandran, Subhransu Maji, Charless C Fowlkes, Ste- fano Soatto, and Pietro Perona. Task2vec: Task embedding for meta-learning. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 6430–6439,

work page

[3] [3]

MT- LoRA: Low-rank adaptation approach for efficient multi- task learning

Ahmed Agiza, Marina Neseem, and Sherief Reda. MT- LoRA: Low-rank adaptation approach for efficient multi- task learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16196– 16205, 2024. 2, 3, 5, 6, 7, 8, 1

work page 2024

[4] [4]

Sequential modeling enables scalable learn- ing for large vision models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan L Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learn- ing for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024. 1

work page 2024

[5] [5]

Fair resource allocation in multi-task learning

Hao Ban and Kaiyi Ji. Fair resource allocation in multi-task learning. InForty-first International Conference on Machine Learning, 2024. 3

work page 2024

[6] [6]

Automated search for resource- efficient branched multi-task networks

David Br ¨uggemann, Menelaos Kanakis, Stamatios Geor- goulis, and Luc Van Gool. Automated search for resource- efficient branched multi-task networks. In31st British Machine Vision Conference 2020, BMVC 2020, page 359. BMV A Press, 2020. 2, 3

work page 2020

[7] [7]

Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022

Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recogni- tion.Advances in Neural Information Processing Systems, 35:16664–16678, 2022. 2

work page 2022

[8] [8]

Detect what you can: Detecting and representing objects using holistic mod- els and body parts

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fi- dler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic mod- els and body parts. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1971–1978,

work page 1971

[9] [9]

Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and An- drew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InIn- ternational conference on machine learning, pages 794–803. PMLR, 2018. 3

work page 2018

[10] [10]

Vision transformer adapter for dense predictions

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. InThe Eleventh International Conference on Learning Representations, 2023. 2

work page 2023

[11] [11]

arXiv preprint arXiv:2009.09796 (2020)

Michael Crawshaw. Multi-task learning with deep neural networks: A survey.arXiv preprint arXiv:2009.09796, 2020. 3

work page arXiv 2009

[12] [12]

The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International Journal of Computer Vision, 88:303–338, 2010. 5

work page 2010

[13] [13]

Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Pro- cessing Systems, 34:27503–27516, 2021

Chris Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Efficiently identifying task groupings for multi-task learning.Advances in Neural Information Pro- cessing Systems, 34:27503–27516, 2021. 3, 5

work page 2021

[14] [14]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xi- angyu Yue, et al. Llama-adapter v2: Parameter-efficient vi- sual instruction model.arXiv preprint arXiv:2304.15010,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Learn- ing to branch for multi-task learning

Pengsheng Guo, Chen-Yu Lee, and Daniel Ulbricht. Learn- ing to branch for multi-task learning. InInternational confer- ence on machine learning, pages 3854–3863. PMLR, 2020. 2, 3

work page 2020

[16] [16]

Lora+: Efficient low rank adaptation of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models. InForty-first Interna- tional Conference on Machine Learning, 2024. 3

work page 2024

[17] [17]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019. 2, 3

work page 2019

[18] [18]

LoRA: Low- rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low- rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 2, 3, 5, 7, 8

work page 2022

[19] [19]

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5254–5276, 2023. 2

work page 2023

[20] [20]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. InEuropean Conference on Computer Vision, pages 709–727. Springer, 2022. 2, 3

work page 2022

[21] [21]

Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geome- try and semantics. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491,

work page

[22] [22]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceed- ings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 3045–3059, 2021. 2, 3

work page 2021

[23] [23]

Prefix-tuning: Optimiz- ing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Confer- ence on Natural Language Processing (Volume 1: Long Pa- pers), pages 4582–4597, 2021. 2, 3

work page 2021

[24] [24]

Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024

Yang Lin, Xinyu Ma, Xu Chu, Yujie Jin, Zhibang Yang, Yasha Wang, and Hong Mei. Lora dropout as a spar- sity regularizer for overfitting control.arXiv preprint arXiv:2404.09610, 2024. 3

work page arXiv 2024

[25] [25]

Dora: Weight-decomposed low-rank adaptation

Shih-yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. InForty-first International Conference on Ma- chine Learning, 2024. 2, 3

work page 2024

[26] [26]

Polyhistor: Parameter-efficient multi-task adap- tation for dense vision tasks.Advances in Neural Information Processing Systems, 35:36889–36901, 2022

Yen-Cheng Liu, Chih-Yao Ma, Junjiao Tian, Zijian He, and Zsolt Kira. Polyhistor: Parameter-efficient multi-task adap- tation for dense vision tasks.Advances in Neural Information Processing Systems, 35:36889–36901, 2022. 2, 3, 5, 6, 7

work page 2022

[27] [27]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021. 6

work page 2021

[28] [28]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2017. 1

work page 2017

[29] [29]

Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, Tara Javidi, and Rog´erio Schmidt Feris. Fully-adaptive fea- ture sharing in multi-task networks with applications in per- son attribute classification.2017 IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 1131– 1140, 2016. 2, 3

work page 2017

[30] [30]

Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks

Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa De- hghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)...

work page 2021

[31] [31]

Attentive single-tasking of multiple tasks

Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas Kokkinos. Attentive single-tasking of multiple tasks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1851–1860, 2019. 3

work page 2019

[32] [32]

Adapterhub: A framework for adapting transformers.EMNLP 2020, page 46, 2020

Jonas Pfeiffer, Andreas R ¨uckl´e, Clifton Poth, Aishwarya Ka- math, Ivan Vulic, Sebastian Ruder, Kyunghyun Cho, and Iryna Gurevych. Adapterhub: A framework for adapting transformers.EMNLP 2020, page 46, 2020. 3

work page 2020

[33] [33]

Adapterfusion: Non- destructive task composition for transfer learning

Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl´e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non- destructive task composition for transfer learning. InPro- ceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Vol- ume, pages 487–503, 2021. 2

work page 2021

[34] [34]

Independent component alignment for multi-task learning

Dmitry Senushkin, Nikolay Patakin, Arseny Kuznetsov, and Anton Konushin. Independent component alignment for multi-task learning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20083–20093, 2023. 3

work page 2023

[35] [35]

Indoor segmentation and support inference from rgbd images

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InComputer Vision–ECCV 2012: 12th Eu- ropean Conference on Computer Vision, Florence, Italy, Oc- tober 7-13, 2012, Proceedings, Part V 12, pages 746–760. Springer, 2012. 5

work page 2012

[36] [36]

Training neu- ral networks with fixed sparse masks.Advances in Neural Information Processing Systems, 34:24193–24205, 2021

Yi-Lin Sung, Varun Nair, and Colin A Raffel. Training neu- ral networks with fixed sparse masks.Advances in Neural Information Processing Systems, 34:24193–24205, 2021. 2, 3

work page 2021

[37] [37]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Branched multi-task networks: Deciding what layers to share.Proceedings BMVC 2020,

Simon Vandenhende, Stamatios Georgoulis, Bert De Bra- bandere, and Luc Van Gool. Branched multi-task networks: Deciding what layers to share.Proceedings BMVC 2020,

work page 2020

[40] [40]

Mti-net: Multi-scale task interaction networks for multi-task learning

Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Mti-net: Multi-scale task interaction networks for multi-task learning. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 527–543. Springer, 2020. 5

work page 2020

[41] [41]

Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021

Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(7):3614–3633, 2021. 3, 5

work page 2021

[42] [42]

Deep high-resolution represen- tation learning for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349– 3364, 2020

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10):3349– 3364, 2020. 6

work page 2020

[43] [43]

Adamix: Mixture-of-adaptations for parameter-efficient model tuning

Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xi- aodong Liu, Jing Gao, Ahmed Hassan, and Jianfeng Gao. Adamix: Mixture-of-adaptations for parameter-efficient model tuning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5744–5760, 2022. 2

work page 2022

[44] [44]

Parameter-efficient fine-tuning for pre-trained vision models: A survey.arXiv preprint arXiv:2402.02242, 2024

Yi Xin, Siqi Luo, Haodi Zhou, Junlong Du, Xiaohong Liu, Yue Fan, Qing Li, and Yuntao Du. Parameter-efficient fine-tuning for pre-trained vision models: A survey.arXiv preprint arXiv:2402.02242, 2024. 2

work page arXiv 2024

[45] [45]

Pad-net: Multi-tasks guided prediction-and-distillation net- work for simultaneous depth estimation and scene parsing

Dan Xu, Wanli Ouyang, Xiaogang Wang, and Nicu Sebe. Pad-net: Multi-tasks guided prediction-and-distillation net- work for simultaneous depth estimation and scene parsing. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 675–684, 2018. 5

work page 2018

[46] [46]

Inverted pyramid multi-task trans- former for dense scene understanding

Hanrong Ye and Dan Xu. Inverted pyramid multi-task trans- former for dense scene understanding. InEuropean Confer- ence on Computer Vision, pages 514–530. Springer, 2022. 5

work page 2022

[47] [47]

Unleashing the power of multi- task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024

Jun Yu, Yutong Dai, Xiaokang Liu, Jin Huang, Yishan Shen, Ke Zhang, Rong Zhou, Eashan Adhikarla, Wenx- uan Ye, Yixin Liu, et al. Unleashing the power of multi- task learning: A comprehensive survey spanning traditional, deep, and pretrained foundation model eras.arXiv preprint arXiv:2404.18961, 2024. 3

work page arXiv 2024

[48] [48]

Gradient surgery for multi-task learning.Advances in Neural Information Pro- cessing Systems, 33:5824–5836, 2020

Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning.Advances in Neural Information Pro- cessing Systems, 33:5824–5836, 2020. 2

work page 2020

[49] [49]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th An- nual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 1–9, 2022. 3

work page 2022

[50] [50]

Adaptive budget allocation for parameter-efficient fine- tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine- tuning. InThe Eleventh International Conference on Learn- ing Representations, 2023. 2, 3 Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation Supplementary Material

work page 2023

[51] [51]

[3] for fine-tuning the models on the PASCAL dataset

Training Hyperparameters PASCAL.We replicate the hyperparameters from Ag- iza et al. [3] for fine-tuning the models on the PASCAL dataset. Specifically, we use the AdamW optimizer [28] with a batch size of 32, a learning rate of 3.125×10 −5, and a weight decay of 0.05. The models are fine-tuned for 300 epochs, with evaluations every 20 epochs. We use a li...

work page

[52] [52]

Single Task – LoRA

Additional Results In this section, we present additional experiments on the PASCAL and NYUD-v2 datasets. Continuing from Sec- tion 4.5, Table 5 illustrates the performance of TGLoRA for varying trainable parameters. The rank of the low-rank modules in TGLoRA layers controls this number. The ta- ble also shows the performance of “Single Task – LoRA” and “...

work page

[53] [53]

Computational Budget vs Performance The tree structure offers a trade-off between model per- formance and inference cost. For instance, using 6.89M parameters for PASCAL-Context, and assigning one, one, two, and four task groups to the first through fourth stages, respectively, results in∆m= +3.93%with 38.37 GMacs. Similarly, configuring the stages with o...

work page