arxiv: 2304.15010 · v1 · submitted 2023-04-28 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· cs.MM

Recognition: 2 theorem links

· Lean Theorem

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao , Jiaming Han , Renrui Zhang , Ziyi Lin , Shijie Geng , Aojun Zhou , Wei Zhang , Pan Lu

show 4 more authors

Conghui He Xiangyu Yue Hongsheng Li Yu Qiao

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGcs.MM

keywords LLaMA-Adapterparameter-efficient tuningvisual instruction followingmulti-modal LLMearly fusionjoint traininginstruction tuning

0 comments

The pith

LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt large language models for multi-modal reasoning without full retraining. It augments the original LLaMA-Adapter by unlocking extra learnable parameters such as norms, biases, and scales so instruction following spreads through the whole model. Visual tokens are fed only into early layers via an early-fusion step, and image-text alignment and instruction data are trained jointly on separate parameter groups to stop one task from harming the other. Expert models for captioning or OCR can be added at inference time with no added training cost. The outcome is stronger open-ended visual instruction performance and even better language-only chat ability than the first version, all while using modest data scales.

Core claim

LLaMA-Adapter V2 shows that unlocking additional parameters across the base LLaMA model, placing visual tokens in early layers only, and training image-text pairs and instruction data on disjoint parameter sets together produce open-ended multi-modal instruction following with 14 million extra parameters over LLaMA.

What carries the argument

Early fusion of visual tokens into initial LLM layers together with disjoint-parameter joint training on image-text alignment and instruction data.

If this is right

Open-ended multi-modal instructions become possible with far fewer added parameters than full fine-tuning.
Language-only instruction following improves as a side effect of the same training.
Expert vision modules can be swapped in at test time without retraining the core model.
Small-scale image-text and instruction datasets suffice for competitive multi-modal reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unlocked-parameter and early-fusion pattern may transfer directly to other base language models beyond LLaMA.
Scaling the base model size could reduce the relative parameter overhead even further while preserving the same training separation.
The approach suggests a route to multi-modal capability that avoids the need for massive paired training corpora.

Load-bearing premise

Early fusion and disjoint parameter groups will keep preventing interference between alignment and instruction tasks even when the data distribution shifts or larger base models are used.

What would settle it

Performance collapse on a new open-ended visual instruction benchmark that shows clear task interference between image-text alignment and instruction following would disprove the central claim.

read the original abstract

How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

LLaMA-Adapter V2 adds early visual fusion, unlocked norms/biases/scales, and disjoint-parameter joint training to the original adapter, delivering open-ended multimodal instructions with 14M extra parameters, but the interference-reduction claim lacks a direct ablation. The three changes are straightforward engineering moves that keep the added cost low while spreading instruction ability across the base model. Early fusion gets visual tokens in sooner, which is a reasonable way to improve incorporation without extra parameters. The code release lets others verify the numbers directly, and the reported lift on language-only instructions is a useful side observation. These details make the work easy to adopt for groups already running adapter-style fine-tuning. The soft spot sits in the joint-training setup. The abstract claims the disjoint parameter groups reduce interference between image-text alignment and instruction following, yet no ablation compares shared versus separate optimization on the same mixture, and no diagnostic such as gradient overlap or per-task curves is shown. Without that measurement it is hard to credit the split specifically rather than the other two modifications. The assumption that the split will continue to prevent task conflict under shifted data or larger base models is therefore the least supported part. This paper is for teams that want a low-parameter recipe to add vision to LLaMA-style models. Readers already working with adapters will get concrete implementation details they can test quickly. I would send it to peer review because the modifications are clearly described, the claims are modest and testable with the released code, and the practical angle is relevant even if one ablation is missing.

Referee Report

2 major / 2 minor

Summary. The paper introduces LLaMA-Adapter V2 as a parameter-efficient extension of the original LLaMA-Adapter for visual instruction following. It augments the base model by unlocking additional learnable parameters (norms, biases, scales) across LLaMA layers, introduces an early-fusion strategy that injects visual tokens only into early LLM layers, proposes joint training on image-text pairs and instruction data using disjoint groups of learnable parameters to reduce task interference, and incorporates external expert models (captioning/OCR) at inference time. The central claim is that these changes enable open-ended multi-modal instruction following with only 14M added parameters over LLaMA while also strengthening language-only capabilities.

Significance. If the empirical results hold, the work demonstrates a lightweight route to multi-modal instruction models that avoids full fine-tuning of large LLMs. The combination of parameter unlocking, early fusion, and disjoint-parameter joint training addresses a practical bottleneck in scaling visual instruction tuning, and the inference-time expert integration provides a zero-training-cost way to boost image understanding. These elements could influence subsequent adapter-based multi-modal systems if the interference-reduction mechanism is shown to generalize.

major comments (2)

[Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.
[§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.

minor comments (2)

[§2] The exact count and placement of the unlocked norm/bias/scale parameters should be stated explicitly (e.g., which layers and which modules) rather than left as 'e.g., norm, bias and scale'.
[Figures] Figure captions and axis labels in the (referenced) ablation and comparison figures should include the precise training data sizes and hyper-parameter settings used for each variant to allow direct reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of our contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.

Authors: We agree that an explicit ablation isolating the disjoint-parameter strategy would strengthen the claim. Our joint training results demonstrate that optimizing disjoint parameter groups on image-text pairs and instruction data yields strong multi-modal performance with only 14M added parameters, outperforming the original LLaMA-Adapter. In the revision we will add an ablation comparing shared versus disjoint optimization on the same data mixture, including performance deltas and task-specific metrics to isolate the interference-reduction effect. revision: yes
Referee: [§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.

Authors: The full manuscript contains Tables 1-4 with quantitative comparisons on VQA, captioning, and instruction-following benchmarks. We acknowledge that error bars, significance tests, and per-task breakdowns would improve assessment of robustness. In the revision we will add error bars for multi-seed experiments, report statistical significance where applicable, and include expanded per-task breakdowns in the main text or supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture and training choices with external benchmarks

full rationale

The paper describes three design choices (unlocking norm/bias/scale parameters, early visual fusion, and disjoint-parameter joint training on image-text plus instruction data) followed by empirical evaluation against the prior LLaMA-Adapter and external models. No equations, first-principles derivations, or predictions appear in the provided text. The 14M-parameter claim is a direct count of added learnable weights, not a fitted or self-referential quantity. Self-citation to the original LLaMA-Adapter is present but functions only as a baseline comparison, not as load-bearing justification for any uniqueness theorem or ansatz. The joint-training interference claim is asserted without an internal ablation, but this is an evidence gap rather than circularity; the reported results remain falsifiable against held-out benchmarks. No step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the standard transformer architecture and training assumptions already present in the LLaMA literature; no new physical or mathematical entities are postulated.

axioms (2)

standard math Transformer layers with standard attention and feed-forward blocks can be adapted via low-rank or partial-parameter updates while preserving core language capabilities.
Invoked implicitly when the authors unlock norm, bias, and scale parameters across the full model.
domain assumption Visual tokens can be treated as additional sequence elements that the language model can attend to without architectural redesign.
Underlying the early-fusion strategy.

pith-pipeline@v0.9.0 · 5652 in / 1338 out tokens · 28046 ms · 2026-05-15T08:36:50.707814+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose an early fusion strategy to feed visual tokens only into the early LLM layers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
cs.CL 2024-09 accept novelty 8.0

MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
cs.CL 2023-11 unverdicted novelty 8.0

MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
cs.CV 2023-12 conditional novelty 7.0

Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Latent Denoising Improves Visual Alignment in Large Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator Generalization
cs.RO 2026-04 unverdicted novelty 5.0

Adaptor uses few-shot learning with trajectory perturbation and vision-language conditioning to achieve robust cross-operator intent recognition and higher success rates in assistive teleoperation.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
cs.LG 2024-03 accept novelty 4.0

A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 20 Pith papers · 16 internal anchors

[1]

https://sharegpt.com/

Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/. 1, 3, 7, 8

work page
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,

work page
[3]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 6077–6086, 2018. 3

work page 2018
[4]

Lan- guage models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3

work page 1901
[5]

Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 3, 8

work page 2021
[6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 3, 4, 6, 7, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality, March 2023. 1, 3, 8

work page 2023
[8]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-ﬁnetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Meshed-memory transformer for image cap- tioning

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. Meshed-memory transformer for image cap- tioning. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 10578–10587,

work page
[10]

Using lora for efﬁcient sta- ble diffusion ﬁne-tuning

Pedro Cuenca and Sayak Paul. Using lora for efﬁcient sta- ble diffusion ﬁne-tuning. https://huggingface.co/ blog/lora, January 2023. 3

work page 2023
[11]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, 2019. 3

work page 2019
[12]

Delta tuning: A comprehen- sive study of parameter efﬁcient methods for pre-trained lan- guage models

Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zong- han Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi- Min Chan, Weize Chen, et al. Delta tuning: A comprehen- sive study of parameter efﬁcient methods for pre-trained lan- guage models. arXiv preprint arXiv:2203.06904, 2022. 3

work page arXiv 2022
[13]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3

work page 2021
[14]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

arXiv preprint arXiv:2110.04544 , year=

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 11 Figure 10. A Chatting Example using 65B LLaMA-Adapter V2. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 3

work page arXiv 2021
[16]

Dy- namic fusion with intra-and inter-modality attention ﬂow for visual question answering

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dy- namic fusion with intra-and inter-modality attention ﬂow for visual question answering. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6639– 6648, 2019. 3

work page 2019
[17]

Making pre- trained language models better few-shot learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- trained language models better few-shot learners. In ACL- IJCNLP, 2021. 3

work page 2021
[18]

Openagi: When llm meets domain experts

Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,

work page arXiv
[19]

Koala: A di- alogue model for academic research

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A di- alogue model for academic research. Blog post, April 2023. 1

work page 2023
[20]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3

work page 2017
[21]

Visual pro- gramming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. 4

work page arXiv 2022
[22]

Towards a uniﬁed view of parameter-efﬁcient transfer learning

Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Kirkpatrick, and Graham Neubig. Towards a uniﬁed view of parameter-efﬁcient transfer learning. In International Con- 12 Figure 11. A Chatting Example using 7B LLaMA-Adapter V2. ference on Learning Representations, 2022. 3

work page 2022
[23]

A comprehensive survey of deep learn- ing for image captioning

MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratud- din, and Hamid Laga. A comprehensive survey of deep learn- ing for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019. 3

work page 2019
[24]

Parameter-efﬁcient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efﬁcient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3

work page 2019
[25]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 5

work page internal anchor Pith review Pith/arXiv arXiv 2021
[26]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 3

work page 2022
[27]

Inner monologue: Embodied reason- ing through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jack- son, Noah Brown, Linda Luu, Sergey Levine, Karol Haus- man, and brian ichter. Inner monologue: Embodied reason- ing through planning with language models. In 6th Annual Conference on Robot Lea...

work page 2022
[28]

Vi- sual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022. 3, 5

work page 2022
[29]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. In Proceedings of the 38th International Confer- ence on Machine Learning (ICML), pages 5583–5594, 2021. 3

work page 2021
[30]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 3, 8

work page 2017
[31]

The power of scale for parameter-efﬁcient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efﬁcient prompt tuning. In EMNLP,

work page
[32]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- ﬁed vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 8

work page 2022
[34]

Visualbert: A simple and perfor- 13 mant baseline for vision and language

Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and perfor- 13 mant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. 3

work page arXiv 1908
[35]

Preﬁx-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Preﬁx-tuning: Optimizing continuous prompts for generation. In ACL, 2021. 3

work page 2021
[36]

Scaling & shifting your features: A new baseline for efﬁcient model tuning.arXiv preprint arXiv:2210.08823,

Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efﬁcient model tuning.arXiv preprint arXiv:2210.08823,

work page arXiv
[37]

Text2motion: From natu- ral language instructions to feasible plans

Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, and Jeannette Bohg. Text2motion: From natu- ral language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. 4

work page arXiv 2023
[38]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Infor- mation Processing Systems (NeurIPS) , pages 13–23, 2019. 3

work page 2019
[40]

Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning

Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 375–383, 2017. 3

work page 2017
[41]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3, 4

work page 2022
[42]

Chameleon: Plug-and-play compositional reasoning with large language models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 ,

work page arXiv
[43]

Compacter: Efﬁcient low-rank hypercomplex adapter layers

Rabeeh Karimi mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efﬁcient low-rank hypercomplex adapter layers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wort- man Vaughan, editors, Advances in Neural Information Pro- cessing Systems, 2021. 3

work page 2021
[44]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 10, 11

work page 2021
[45]

Film: Follow- ing instructions in language with modular methods

So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Follow- ing instructions in language with modular methods. ArXiv, abs/2110.07342, 2021. 3

work page arXiv 2021
[46]

Clip- cap: Clip preﬁx for image captioning

Ron Mokady, Amir Hertz, and Amit H Bermano. Clip- cap: Clip preﬁx for image captioning. arXiv preprint arXiv:2111.09734, 2021. 8

work page arXiv 2021
[47]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Training language models to follow instructions with human feed- back

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 1, 3

work page 2022
[49]

Peft: State-of-the-art parameter-efﬁcient ﬁne-tuning methods

Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efﬁcient ﬁne-tuning methods. https : / / github.com/huggingface/peft, 2022. 3

work page 2022
[50]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 1, 3, 5, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Tool learning with foundation models

Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023. 4

work page arXiv 2023
[52]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 4

work page 2021
[53]

Improving language understanding by gen- erative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3

work page 2018
[54]

Language models are unsu- pervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. OpenAI blog, 1(8):9, 2019. 3

work page 2019
[55]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 3

work page 2022
[56]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3, 8

work page internal anchor Pith review Pith/arXiv arXiv 2021
[57]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 3, 8

work page 2018
[58]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Lst: Lad- der side-tuning for parameter and memory efﬁcient transfer learning

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efﬁcient transfer learning. arXiv preprint arXiv:2206.06522, 2022. 3

work page arXiv 2022
[60]

Vl-adapter: Parameter-efﬁcient transfer learning for vision-and-language tasks

Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efﬁcient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 5227–5237,

work page
[61]

Vipergpt: Visual inference via python execution for reasoning,

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 4 14

work page arXiv 2023
[62]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu- lab/ stanford_alpaca, 2023. 1, 3

work page 2023
[63]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efﬁcient foundation lan- guage models. arXiv preprint arXiv:2302.13971, 2023. 1, 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence , 39(4):652–663,

work page 2015
[65]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[66]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learn- ers. In International Conference on Learning Representa- tions, 2022. 3

work page 2022
[67]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Baize: An open-source chat model with parameter-efﬁcient tuning on self-chat data

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efﬁcient tuning on self-chat data. arXiv preprint arXiv:2304.01196,

work page arXiv
[69]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Bitﬁt: Simple parameter-efﬁcient ﬁne-tuning for transformer-based masked language-models

Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitﬁt: Simple parameter-efﬁcient ﬁne-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199,

work page arXiv
[71]

Pointclip: Point cloud understanding by clip

Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022. 3

work page 2022
[72]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efﬁcient ﬁne-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 ,

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners

Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Han- qiu Deng, Hongsheng Li, Yu Qiao, and Peng Gao. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023, 2023. 3

work page 2023
[74]

Tip- adapter: Training-free adaption of clip for few-shot classi- ﬁcation

Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- ﬁcation. In ECCV, 2022. 3

work page 2022
[75]

A Survey of Large Language Models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language mod- els. arXiv preprint arXiv:2303.18223, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[76]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Inter- national Journal of Computer Vision, pages 1–12, 2022. 3

work page 2022
[77]

Uniﬁed vision-language pre- training for image captioning and vqa

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja- son Corso, and Jianfeng Gao. Uniﬁed vision-language pre- training for image captioning and vqa. In Proceedings of the AAAI conference on artiﬁcial intelligence, volume 34, pages 13041–13049, 2020. 3

work page 2020
[78]

Minigpt-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023. 2, 3, 6, 7

work page 2023
[79]

Pointclip v2: Adapting clip for powerful 3d open-world learning

Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022. 4 15

work page arXiv 2022