pith. the verified trust layer for science. sign in

arxiv: 2511.18787 · v2 · submitted 2025-11-24 · 💻 cs.CV · cs.LG

Understanding Task Transfer in Vision-Language Models

Pith reviewed 2026-05-17 06:49 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords task transfervision-language modelsperception tasksfinetuning effectstransfer graphPerfection Gap Factorzero-shot performancedata selection
0
0 comments X p. Extension

The pith

Finetuning a vision-language model on one perception task produces measurable positive or negative effects on others that can be mapped into a transfer graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models excel at broad multimodal tasks yet fall short on specific visual perception skills such as depth estimation or object counting. Finetuning on any single task can improve or degrade zero-shot results on the remaining tasks in ways that are hard to anticipate. The authors introduce the Perfection Gap Factor, a normalized score that quantifies how much performance shifts after finetuning, and combine it with a breadth-and-magnitude measure called Task Transferability. They run the process on three open-weight models across thirteen perception tasks and assemble the resulting relationships into a task transfer graph. The graph exposes clusters of mutually reinforcing tasks, identifies tasks that cause interference, groups tasks by characteristic transfer behavior, and shows how the metric can steer data selection toward more efficient training.

Core claim

Using the Perfection Gap Factor to normalize performance changes after finetuning and computing Task Transferability to capture both scope and size of those changes, the authors build a task transfer graph from three VLMs evaluated on thirteen perception tasks; the graph reveals previously unobserved patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas according to their transfer signatures, and demonstrates that the same metric can guide data selection for more efficient model training.

What carries the argument

The task transfer graph, constructed from Perfection Gap Factor scores that measure normalized performance change after finetuning on a source task and Task Transferability scores that combine breadth and magnitude of those changes.

If this is right

  • Selecting a source task with high positive transferability to a target task can improve zero-shot performance without additional labels for the target.
  • Avoiding source tasks that produce negative transfer can prevent unintended performance drops on related perception skills.
  • Tasks that form mutual-influence clusters can be trained together to exploit synergies rather than trained in isolation.
  • Persona groupings derived from transfer behavior allow models to be specialized for families of related perception tasks rather than for single tasks.
  • PGF-based data selection can reduce the total number of finetuning examples needed while preserving or increasing average performance across the task set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The transfer graph implies that many perception tasks share latent visual features inside current VLMs, so mapping more tasks onto the same graph could reveal larger organizing principles.
  • If the patterns persist at larger model scales, practitioners could maintain a shared transfer database to choose pre-training or instruction-tuning data without exhaustive search.
  • Extending the graph to include generative or reasoning tasks would test whether the same positive and negative interference rules apply outside pure perception.
  • The persona classification could be used to design curricula that progress from low-interference to high-synergy task sequences during continued training.

Load-bearing premise

The thirteen chosen perception tasks and the three open-weight models are representative enough that the observed transfer patterns will hold for other tasks and other models.

What would settle it

Running the identical finetuning and evaluation protocol on a new set of perception tasks or on additional VLMs and obtaining transfer relationships that bear no structural resemblance to the published graph.

Figures

Figures reproduced from arXiv: 2511.18787 by Abhinav Java, Bhuvan Sachdeva, Karan Uppal, Vineeth N. Balasubramanian.

Figure 1
Figure 1. Figure 1: One finetune, many fates: Finetuning Qwen-2.5-VL 32B on perception tasks creates a structured map of transfer capabilities. (The [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PGF Heatmaps for Qwen-2.5-VL model family (3B, 7B, 32B). [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average positive malleability trends across granular and [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average positive transferability trends across granular and [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task transferability trends across model sizes in Qwen [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Positive clique of size 9 from Qwen-2.5-VL 32B. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance comparison under different dataset selection [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: A negative clique of size 4 from Qwen-2.5-VL 32B. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) perform well on multimodal benchmarks but lag behind humans and specialized models on visual perception tasks like depth estimation or object counting. Finetuning on one task can unpredictably affect performance on others, making task-specific finetuning challenging. In this paper, we address this challenge through a systematic study of task transferability. We examine how finetuning a VLM on one perception task affects its zero-shot performance on others. We introduce Perfection Gap Factor (PGF), a normalized metric that measures change in performance as a result of task transfer. We utilize PGF to compute Task Transferability, which captures both the breadth and the magnitude of transfer induced by a source task. Using three open-weight VLMs evaluated across 13 perception tasks, we construct a task transfer graph that reveals previously unobserved relationships among perception tasks. Our analysis uncovers patterns of positive and negative transfer, identifies groups of tasks that mutually influence each other, organizes tasks into personas based on their transfer behavior and demonstrates how PGF can guide data selection for more efficient training. These findings highlight both opportunities for positive transfer and risks of negative interference, offering actionable guidance for advancing VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents an empirical study of task transfer in vision-language models. It finetunes three open-weight VLMs on each of 13 perception tasks, measures zero-shot performance changes on the remaining tasks, and introduces the Perfection Gap Factor (PGF) as a normalized metric of performance delta. From these measurements the authors compute Task Transferability scores, construct a task transfer graph, and report patterns of positive/negative transfer, mutually influencing task clusters, task 'personas,' and potential uses of PGF for data selection.

Significance. If the observed transfer patterns prove robust beyond the chosen models and tasks, the work would supply a practical framework for understanding inter-task interference and synergies in VLM finetuning, with direct implications for more efficient training regimes. The introduction of PGF and the graph-based analysis constitute a concrete, falsifiable contribution to the study of multimodal transfer.

major comments (2)
  1. [§5] §5 (Experimental Setup and Results): The task transfer graph and all derived claims about previously unobserved relationships, clusters, and personas rest on evaluations using only three open-weight VLMs and a fixed collection of 13 perception tasks. No sensitivity analysis, ablation on model family, or resampling of tasks is reported; therefore it remains possible that the reported positive/negative transfers and mutual-influence structure are artifacts of this narrow experimental slice rather than stable properties of perception tasks in VLMs.
  2. [§3.2] §3.2 (Definition of PGF): The Perfection Gap Factor is introduced as a normalized metric of performance change, yet the manuscript provides neither the exact formula relating PGF to raw accuracy deltas nor any demonstration that the normalization is independent of the choice of zero-shot baseline. Because Task Transferability is computed directly from PGF values, any dependence on post-hoc baseline selection would propagate into the graph and undermine the central empirical claims.
minor comments (3)
  1. [Abstract] The abstract refers to 'task personas' without a one-sentence definition or illustrative example; a brief clarification would improve readability for readers unfamiliar with the concept.
  2. [Figures] Figure captions and axis labels for the task transfer graph should explicitly indicate how edge weights or colors encode positive versus negative transfer to make the reported patterns immediately verifiable from the figure.
  3. [§4] A short table listing the 13 perception tasks together with their source datasets and evaluation metrics would help readers assess the diversity of the task set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§5] §5 (Experimental Setup and Results): The task transfer graph and all derived claims about previously unobserved relationships, clusters, and personas rest on evaluations using only three open-weight VLMs and a fixed collection of 13 perception tasks. No sensitivity analysis, ablation on model family, or resampling of tasks is reported; therefore it remains possible that the reported positive/negative transfers and mutual-influence structure are artifacts of this narrow experimental slice rather than stable properties of perception tasks in VLMs.

    Authors: We agree that the experimental scope is limited to three open-weight VLMs and 13 perception tasks, and that the absence of sensitivity analysis or task resampling leaves open the possibility that some observed patterns could be specific to this selection. Our choice of models and tasks was driven by the need for computational tractability while covering a broad range of visual perception capabilities. In the revised manuscript we will add an explicit limitations subsection that discusses the scope of the reported transfer graph and clusters, and we will include a small-scale sensitivity check (e.g., one additional model or a subset of tasks) where feasible. We view this as a partial revision that clarifies rather than fully eliminates the concern. revision: partial

  2. Referee: [§3.2] §3.2 (Definition of PGF): The Perfection Gap Factor is introduced as a normalized metric of performance change, yet the manuscript provides neither the exact formula relating PGF to raw accuracy deltas nor any demonstration that the normalization is independent of the choice of zero-shot baseline. Because Task Transferability is computed directly from PGF values, any dependence on post-hoc baseline selection would propagate into the graph and undermine the central empirical claims.

    Authors: We will revise Section 3.2 to state the precise mathematical definition of PGF, explicitly showing its relation to raw accuracy deltas and the zero-shot baseline. We will also add a short robustness analysis (main text or appendix) that recomputes Task Transferability under alternative baseline choices and confirms that the resulting graph structure and rankings remain stable. This constitutes a full revision of the presentation and supporting evidence for the metric. revision: yes

Circularity Check

0 steps flagged

No circularity: PGF and task transferability are direct empirical computations from measured performance deltas

full rationale

The paper defines Perfection Gap Factor (PGF) explicitly as a normalized metric on observed performance changes after finetuning, then derives Task Transferability and the transfer graph by applying this metric to results from 13 tasks across 3 VLMs. This is a data-driven construction with no reduction of outputs to inputs by definition, no fitted parameters renamed as predictions, and no load-bearing self-citations or imported uniqueness theorems. The derivation chain remains self-contained against the experimental measurements; representativeness concerns affect generalizability but do not create circularity in the reported steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the empirical definition of PGF as a normalized performance-change metric and on the assumption that the selected tasks and models capture general transfer behavior; no free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Zero-shot performance on a target task can be measured after fine-tuning on a source task without confounding effects from prompt or evaluation protocol changes.
    Invoked when the authors compare pre- and post-fine-tuning zero-shot scores across tasks.
invented entities (2)
  • Perfection Gap Factor (PGF) no independent evidence
    purpose: Normalized metric that quantifies the signed change in performance on a target task after fine-tuning on a source task.
    Newly defined quantity introduced to enable the transfer graph and Task Transferability score.
  • Task Transferability no independent evidence
    purpose: Scalar that aggregates breadth and magnitude of transfer induced by a source task.
    Derived quantity computed from PGF values across target tasks.

pith-pipeline@v0.9.0 · 5513 in / 1428 out tokens · 59314 ms · 2026-05-17T06:49:10.805461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 4 internal anchors

  1. [1]

    Tal- lyqa: Answering complex counting questions, 2018

    Manoj Acharya, Kushal Kafle, and Christopher Kanan. Tal- lyqa: Answering complex counting questions, 2018. 21

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  3. [3]

    Hpatches: A benchmark and evaluation of hand- crafted and learned local descriptors

    Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of hand- crafted and learned local descriptors. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5173–5182, 2017. 21

  4. [4]

    An information- 8 theoretic approach to transferability in task transfer learning,

    Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, and Leonidas Guibas. An information- 8 theoretic approach to transferability in task transfer learning,

  5. [5]

    Intrinsic images in the wild.ACM Transactions on Graphics (TOG), 33(4): 1–12, 2014

    Sean Bell, Kavita Bala, and Noah Snavely. Intrinsic images in the wild.ACM Transactions on Graphics (TOG), 33(4): 1–12, 2014. 21

  6. [6]

    Learning more may not be better: Knowledge transferability in vision-and-language tasks.Journal of Imaging, 10(12):300, 2024

    Tianwei Chen, Noa Garcia, Mayu Otani, Chenhui Chu, Yuta Nakashima, and Hajime Nagahara. Learning more may not be better: Knowledge transferability in vision-and-language tasks.Journal of Imaging, 10(12):300, 2024. 2, 3

  7. [7]

    Single- image depth perception in the wild.Advances in neural information processing systems, 29, 2016

    Weifeng Chen, Zhao Fu, Dawei Yang, and Jia Deng. Single- image depth perception in the wild.Advances in neural information processing systems, 29, 2016. 21

  8. [8]

    Qlora: Efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088– 10115, 2023. 3

  9. [9]

    Dreamsim: Learning new dimensions of human visual similarity using synthetic data

    Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems, pages 50742–50768, 2023. 21

  10. [10]

    There’s a Time and Place for Rea- soning Beyond the Image

    Xingyu Fu, Ben Zhou, Ishaan Preetam Chandratreya, Carl V ondrick, and Dan Roth. There’s a Time and Place for Rea- soning Beyond the Image. InProc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2022. 21

  11. [11]

    Blink: Multimodal large language mod- els can see but not perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language mod- els can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024. 1, 3, 4

  12. [12]

    Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset

    Yang Fu and Xiaolong Wang. Category-level 6d object pose estimation in the wild: A semi-supervised learning approach and a new dataset. InAdvances in Neural Information Pro- cessing Systems, 2022. 21

  13. [13]

    Lvis: A dataset for large vocabulary instance segmentation, 2019

    Agrim Gupta, Piotr Dollár, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation, 2019. 2, 21

  14. [14]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint arXiv:2403.14608,

  15. [15]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

  16. [16]

    Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

    Maggie Huan, Yuetai Li, Tuney Zheng, Xiaoyu Xu, Seun- gone Kim, Minxin Du, Radha Poovendran, Graham Neubig, and Xiang Yue. Does math reasoning improve general llm capabilities? understanding transferability of llm reasoning. arXiv preprint arXiv:2507.00432, 2025. 3

  17. [17]

    The functional correspondence problem

    Zihang Lai, Senthil Purushwalkam, and Abhinav Gupta. The functional correspondence problem. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15772–15781, 2021. 21

  18. [18]

    Llava-onevision: Easy visual task transfer, 2024

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer, 2024. 1

  19. [19]

    Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models, 2024. 1

  20. [20]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bour- dev, Ross Girshick, James Hays, Pietro Perona, Deva Ra- manan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 2, 21

  21. [21]

    Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023

    Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning.Transactions of the Association for Computational Linguistics, 2023. 21

  22. [22]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, pages 34892–34916. Curran Associates, Inc., 2023. 1

  23. [23]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 1

  24. [24]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemati- cal reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 2

  25. [25]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning,

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning,

  26. [26]

    V Jawahar

    Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. Infographicvqa,

  27. [27]

    Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images, 2021. 1, 2

  28. [28]

    Spair- 71k: A large-scale benchmark for semantic correspondence,

    Juhong Min, Jongmin Lee, Jean Ponce, and Minsu Cho. Spair- 71k: A large-scale benchmark for semantic correspondence,

  29. [29]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1

  30. [30]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yux- iong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InPro- ceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3505–3506,

  31. [31]

    How well do contrastively trained models transfer? InFirst Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, 2022

    M Moein Shariatnia, Rahim Entezari, Mitchell Wortsman, Olga Saukh, and Ludwig Schmidt. How well do contrastively trained models transfer? InFirst Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022, 2022. 2, 3

  32. [32]

    Towards vqa models that can read, 2019

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. 1

  33. [33]

    What are 9 we measuring when we evaluate large vision-language mod- els? an analysis of latent factors and biases.arXiv preprint arXiv:2404.02415, 2024

    Anthony Meng Huat Tiong, Junqi Zhao, Boyang Li, Jun- nan Li, Steven CH Hoi, and Caiming Xiong. What are 9 we measuring when we evaluate large vision-language mod- els? an analysis of latent factors and biases.arXiv preprint arXiv:2404.02415, 2024. 2, 3

  34. [34]

    Eyes wide shut? exploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? exploring the visual shortcomings of multimodal llms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1

  35. [35]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution, 2024. 1, 4

  36. [36]

    A surprising failure? multimodal llms and the nlvr challenge, 2024

    Anne Wu, Kianté Brantley, and Yoav Artzi. A surprising failure? multimodal llms and the nlvr challenge, 2024. 2

  37. [37]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 6

  38. [38]

    Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multi- modal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 1, 2

  39. [39]

    Taskonomy: Disentangling task transfer learning

    Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3712–3722, 2018. 2, 3

  40. [40]

    From recognition to cognition: Visual commonsense reason- ing

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reason- ing. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2 10 Understanding Task Transfer in Vision-Language Models Supplementary Material Table of Contents A.8. PGF Calculation and Heatmaps . . . . . . . . . . . ....