Recognition: 2 theorem links
· Lean TheoremLLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Pith reviewed 2026-05-15 08:36 UTC · model grok-4.3
The pith
LLaMA-Adapter V2 turns LLaMA into an open-ended visual instruction follower by adding only 14 million parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaMA-Adapter V2 shows that unlocking additional parameters across the base LLaMA model, placing visual tokens in early layers only, and training image-text pairs and instruction data on disjoint parameter sets together produce open-ended multi-modal instruction following with 14 million extra parameters over LLaMA.
What carries the argument
Early fusion of visual tokens into initial LLM layers together with disjoint-parameter joint training on image-text alignment and instruction data.
If this is right
- Open-ended multi-modal instructions become possible with far fewer added parameters than full fine-tuning.
- Language-only instruction following improves as a side effect of the same training.
- Expert vision modules can be swapped in at test time without retraining the core model.
- Small-scale image-text and instruction datasets suffice for competitive multi-modal reasoning.
Where Pith is reading between the lines
- The same unlocked-parameter and early-fusion pattern may transfer directly to other base language models beyond LLaMA.
- Scaling the base model size could reduce the relative parameter overhead even further while preserving the same training separation.
- The approach suggests a route to multi-modal capability that avoids the need for massive paired training corpora.
Load-bearing premise
Early fusion and disjoint parameter groups will keep preventing interference between alignment and instruction tasks even when the data distribution shifts or larger base models are used.
What would settle it
Performance collapse on a new open-ended visual instruction benchmark that shows clear task interference between image-text alignment and instruction following would disprove the central claim.
read the original abstract
How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at https://github.com/ZrrSkywalker/LLaMA-Adapter.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaMA-Adapter V2 as a parameter-efficient extension of the original LLaMA-Adapter for visual instruction following. It augments the base model by unlocking additional learnable parameters (norms, biases, scales) across LLaMA layers, introduces an early-fusion strategy that injects visual tokens only into early LLM layers, proposes joint training on image-text pairs and instruction data using disjoint groups of learnable parameters to reduce task interference, and incorporates external expert models (captioning/OCR) at inference time. The central claim is that these changes enable open-ended multi-modal instruction following with only 14M added parameters over LLaMA while also strengthening language-only capabilities.
Significance. If the empirical results hold, the work demonstrates a lightweight route to multi-modal instruction models that avoids full fine-tuning of large LLMs. The combination of parameter unlocking, early fusion, and disjoint-parameter joint training addresses a practical bottleneck in scaling visual instruction tuning, and the inference-time expert integration provides a zero-training-cost way to boost image understanding. These elements could influence subsequent adapter-based multi-modal systems if the interference-reduction mechanism is shown to generalize.
major comments (2)
- [Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.
- [§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.
minor comments (2)
- [§2] The exact count and placement of the unlocked norm/bias/scale parameters should be stated explicitly (e.g., which layers and which modules) rather than left as 'e.g., norm, bias and scale'.
- [Figures] Figure captions and axis labels in the (referenced) ablation and comparison figures should include the precise training data sizes and hyper-parameter settings used for each variant to allow direct reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of our contributions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (joint training description): the claim that disjoint-parameter optimization 'effectively alleviates the interference' between image-text alignment and instruction following is load-bearing for the 14M-parameter performance claim, yet no ablation comparing shared versus disjoint optimization on the same data mixture, nor any diagnostic (gradient similarity, per-task loss curves, or performance delta), is reported. Without this measurement the contribution of the disjoint split cannot be isolated from the effects of unlocked norms/biases or early fusion.
Authors: We agree that an explicit ablation isolating the disjoint-parameter strategy would strengthen the claim. Our joint training results demonstrate that optimizing disjoint parameter groups on image-text pairs and instruction data yields strong multi-modal performance with only 14M added parameters, outperforming the original LLaMA-Adapter. In the revision we will add an ablation comparing shared versus disjoint optimization on the same data mixture, including performance deltas and task-specific metrics to isolate the interference-reduction effect. revision: yes
-
Referee: [§4] §4 (experimental results): the headline comparison to the original LLaMA-Adapter and to GPT-4-style models rests on quantitative tables that are referenced but not shown in the provided text; the absence of error bars, statistical significance tests, or per-task breakdowns makes it impossible to assess whether the reported gains are robust or driven by particular data subsets.
Authors: The full manuscript contains Tables 1-4 with quantitative comparisons on VQA, captioning, and instruction-following benchmarks. We acknowledge that error bars, significance tests, and per-task breakdowns would improve assessment of robustness. In the revision we will add error bars for multi-seed experiments, report statistical significance where applicable, and include expanded per-task breakdowns in the main text or supplementary material. revision: partial
Circularity Check
No circularity: empirical architecture and training choices with external benchmarks
full rationale
The paper describes three design choices (unlocking norm/bias/scale parameters, early visual fusion, and disjoint-parameter joint training on image-text plus instruction data) followed by empirical evaluation against the prior LLaMA-Adapter and external models. No equations, first-principles derivations, or predictions appear in the provided text. The 14M-parameter claim is a direct count of added learnable weights, not a fitted or self-referential quantity. Self-citation to the original LLaMA-Adapter is present but functions only as a baseline comparison, not as load-bearing justification for any uniqueness theorem or ansatz. The joint-training interference claim is asserted without an internal ablation, but this is an evidence gap rather than circularity; the reported results remain falsifiable against held-out benchmarks. No step reduces by construction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Transformer layers with standard attention and feed-forward blocks can be adapted via low-rank or partial-parameter updates while preserving core language capabilities.
- domain assumption Visual tokens can be treated as additional sequence elements that the language model can attend to without architectural redesign.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose an early fusion strategy to feed visual tokens only into the early LLM layers
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
Adaptor: Advancing Assistive Teleoperation with Few-Shot Learning and Cross-Operator Generalization
Adaptor uses few-shot learning with trajectory perturbation and vision-language conditioning to achieve robust cross-operator intent recognition and higher success rates in assistive teleoperation.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
Sharegpt: Share your wildest chatgpt conversations with one click. https://sharegpt.com/. 1, 3, 7, 8
-
[2]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems , 35:23716–23736,
-
[3]
Bottom-up and top-down attention for image captioning and visual question answering
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 6077–6086, 2018. 3
work page 2018
-
[4]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020. 3
work page 1901
-
[5]
Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre- training to recognize long-tail visual concepts. In Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021. 3, 8
work page 2021
-
[6]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- tam, Saurabh Gupta, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015. 2, 3, 4, 6, 7, 8, 10
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[7]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality, March 2023. 1, 3, 8
work page 2023
-
[8]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
Meshed-memory transformer for image cap- tioning
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. Meshed-memory transformer for image cap- tioning. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 10578–10587,
-
[10]
Using lora for efficient sta- ble diffusion fine-tuning
Pedro Cuenca and Sayak Paul. Using lora for efficient sta- ble diffusion fine-tuning. https://huggingface.co/ blog/lora, January 2023. 3
work page 2023
-
[11]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, 2019. 3
work page 2019
-
[12]
Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zong- han Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi- Min Chan, Weize Chen, et al. Delta tuning: A comprehen- sive study of parameter efficient methods for pre-trained lan- guage models. arXiv preprint arXiv:2203.06904, 2022. 3
-
[13]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representa- tions, 2021. 3
work page 2021
-
[14]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm- e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
arXiv preprint arXiv:2110.04544 , year=
Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. 11 Figure 10. A Chatting Example using 65B LLaMA-Adapter V2. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021. 3
-
[16]
Dy- namic fusion with intra-and inter-modality attention flow for visual question answering
Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dy- namic fusion with intra-and inter-modality attention flow for visual question answering. In The IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6639– 6648, 2019. 3
work page 2019
-
[17]
Making pre- trained language models better few-shot learners
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- trained language models better few-shot learners. In ACL- IJCNLP, 2021. 3
work page 2021
-
[18]
Openagi: When llm meets domain experts
Yingqiang Ge, Wenyue Hua, Jianchao Ji, Juntao Tan, Shuyuan Xu, and Yongfeng Zhang. Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370 ,
-
[19]
Koala: A di- alogue model for academic research
Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wallace, Pieter Abbeel, Sergey Levine, and Dawn Song. Koala: A di- alogue model for academic research. Blog post, April 2023. 1
work page 2023
-
[20]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 3
work page 2017
-
[21]
Visual pro- gramming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual pro- gramming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. 4
-
[22]
Towards a unified view of parameter-efficient transfer learning
Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg- Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In International Con- 12 Figure 11. A Chatting Example using 7B LLaMA-Adapter V2. ference on Learning Representations, 2022. 3
work page 2022
-
[23]
A comprehensive survey of deep learn- ing for image captioning
MD Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratud- din, and Hamid Laga. A comprehensive survey of deep learn- ing for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019. 3
work page 2019
-
[24]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. 3
work page 2019
-
[25]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. 5
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
LoRA: Low-rank adaptation of large language models
Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations , 2022. 3
work page 2022
-
[27]
Inner monologue: Embodied reason- ing through planning with language models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Tomas Jack- son, Noah Brown, Linda Luu, Sergey Levine, Karol Haus- man, and brian ichter. Inner monologue: Embodied reason- ing through planning with language models. In 6th Annual Conference on Robot Lea...
work page 2022
-
[28]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi- sual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022. 3, 5
work page 2022
-
[29]
Vilt: Vision- and-language transformer without convolution or region su- pervision
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. In Proceedings of the 38th International Confer- ence on Machine Learning (ICML), pages 5583–5594, 2021. 3
work page 2021
-
[30]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- tidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017. 3, 8
work page 2017
-
[31]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In EMNLP,
-
[32]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022. 8
work page 2022
-
[34]
Visualbert: A simple and perfor- 13 mant baseline for vision and language
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and perfor- 13 mant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019. 3
-
[35]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 2021. 3
work page 2021
-
[36]
Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.arXiv preprint arXiv:2210.08823,
-
[37]
Text2motion: From natu- ral language instructions to feasible plans
Kevin Lin, Christopher Agia, Toki Migimatsu, Marco Pavone, and Jeannette Bohg. Text2motion: From natu- ral language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023. 4
-
[38]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vil- bert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Infor- mation Processing Systems (NeurIPS) , pages 13–23, 2019. 3
work page 2019
-
[40]
Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. Knowing when to look: Adaptive attention via a visual sen- tinel for image captioning. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 375–383, 2017. 3
work page 2017
-
[41]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3, 4
work page 2022
-
[42]
Chameleon: Plug-and-play compositional reasoning with large language models
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842 ,
-
[43]
Compacter: Efficient low-rank hypercomplex adapter layers
Rabeeh Karimi mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wort- man Vaughan, editors, Advances in Neural Information Pro- cessing Systems, 2021. 3
work page 2021
-
[44]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 10, 11
work page 2021
-
[45]
Film: Follow- ing instructions in language with modular methods
So Yeon Min, Devendra Singh Chaplot, Pradeep Ravikumar, Yonatan Bisk, and Ruslan Salakhutdinov. Film: Follow- ing instructions in language with modular methods. ArXiv, abs/2110.07342, 2021. 3
-
[46]
Clip- cap: Clip prefix for image captioning
Ron Mokady, Amir Hertz, and Amit H Bermano. Clip- cap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021. 8
-
[47]
OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Training language models to follow instructions with human feed- back
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 1, 3
work page 2022
-
[49]
Peft: State-of-the-art parameter-efficient fine-tuning methods
Sourab Mangrulkar; Sylvain Gugger; Lysandre Debut; Younes Belkada; Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning methods. https : / / github.com/huggingface/peft, 2022. 3
work page 2022
-
[50]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023. 1, 3, 5, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Tool learning with foundation models
Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023. 4
-
[52]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 4
work page 2021
-
[53]
Improving language understanding by gen- erative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 3
work page 2018
-
[54]
Language models are unsu- pervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners. OpenAI blog, 1(8):9, 2019. 3
work page 2019
-
[55]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. 3
work page 2022
-
[56]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[57]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 2556–2565, 2018. 3, 8
work page 2018
-
[58]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Lst: Lad- der side-tuning for parameter and memory efficient transfer learning
Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Lad- der side-tuning for parameter and memory efficient transfer learning. arXiv preprint arXiv:2206.06522, 2022. 3
-
[60]
Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks
Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 5227–5237,
-
[61]
Vipergpt: Visual inference via python execution for reasoning,
D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 4 14
- [62]
-
[63]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aure- lien Rodriguez, Armand Joulin, Edouard Grave, and Guil- laume Lample. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971, 2023. 1, 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Show and tell: Lessons learned from the 2015 mscoco image captioning challenge
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence , 39(4):652–663,
work page 2015
-
[65]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[66]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learn- ers. In International Conference on Learning Representa- tions, 2022. 3
work page 2022
-
[67]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Baize: An open-source chat model with parameter-efficient tuning on self-chat data
Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196,
-
[69]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models
Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199,
-
[71]
Pointclip: Point cloud understanding by clip
Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu- peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8552–8562, 2022. 3
work page 2022
-
[72]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners
Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Han- qiu Deng, Hongsheng Li, Yu Qiao, and Peng Gao. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. CVPR 2023, 2023. 3
work page 2023
-
[74]
Tip- adapter: Training-free adaption of clip for few-shot classi- fication
Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kun- chang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip- adapter: Training-free adaption of clip for few-shot classi- fication. In ECCV, 2022. 3
work page 2022
-
[75]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language mod- els. arXiv preprint arXiv:2303.18223, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Learning to prompt for vision-language models
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. Inter- national Journal of Computer Vision, pages 1–12, 2022. 3
work page 2022
-
[77]
Unified vision-language pre- training for image captioning and vqa
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Ja- son Corso, and Jianfeng Gao. Unified vision-language pre- training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13041–13049, 2020. 3
work page 2020
-
[78]
Minigpt-4: Enhancing vision-language understanding with advanced large language models
Deyao Zhu, Jun Chen, Xiaoqian Shen, xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023. 2, 3, 6, 7
work page 2023
-
[79]
Pointclip v2: Adapting clip for powerful 3d open-world learning
Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682, 2022. 4 15
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.