pith. machine review for the scientific record. sign in

arxiv: 2305.16355 · v1 · submitted 2023-05-25 · 💻 cs.CL · cs.CV

Recognition: 2 theorem links

· Lean Theorem

PandaGPT: One Model To Instruction-Follow Them All

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:58 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords multimodal instruction followingcross-modal compositionzero-shot multimodalunified embeddingslarge language modelsImageBindemergent capabilities
0
0 comments X

The pith

A single model trained only on image-text pairs can follow instructions on video, audio, depth, and thermal inputs by composing their meanings in a shared embedding space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to connect a multimodal encoder that unifies many data types with a large language model so that instruction following emerges for inputs never seen together during training. This matters because current multimodal systems usually demand separate large datasets for every new sense such as sound or motion, which are expensive to collect. If the approach works, one training regime on ordinary image captions can unlock natural composition across senses, for example linking how an object looks in a video with how it sounds in an audio clip. A sympathetic reader would see this as a practical shortcut toward AI that perceives the world more like humans do, using one shared space rather than many separate alignments.

Core claim

PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Only aligned image-text pairs are required for training, yet the system displays emergent zero-shot cross-modal behaviors for data other than image and text, including video, audio, depth, thermal, and IMU inputs. It can take multimodal inputs simultaneously and compose their semantics naturally, such as connecting visual appearance with auditory properties.

What carries the argument

ImageBind's unified embedding space, which maps inputs from different modalities into the same vector space so the language model can treat them as interchangeable tokens during instruction following.

If this is right

  • Complex instruction tasks such as detailed image description, writing stories from video, and answering questions about audio become possible with one model.
  • Multimodal inputs presented at the same time can be processed by naturally composing their separate meanings.
  • New modalities like depth maps or thermal images receive instruction-following ability without any dedicated training data for them.
  • The same training recipe can be reused across future unified encoders to extend coverage to additional senses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the unified space proves sufficient, the cost of building broad multimodal systems drops sharply because only one modality pair needs alignment data.
  • The same mechanism could be tested on sensor streams from robots, where visual, audio, and inertial measurements must be interpreted together.
  • Future work could measure how much of the cross-modal ability survives when a different unified encoder replaces ImageBind.
  • The pattern suggests that instruction tuning on top of a rich shared space may be enough for many perception tasks, reducing the need for modality-specific fine-tuning.

Load-bearing premise

ImageBind's embedding space already contains enough semantic structure for the language model to compose meanings across modalities without any extra alignment training on those modalities.

What would settle it

A test in which PandaGPT is given paired image and audio inputs and asked to answer a question that requires combining visual and auditory information, such as identifying an object by both its appearance and its sound; consistent failure on such items would show the claimed cross-modal composition does not occur.

read the original abstract

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do. Our project page is at https://panda-gpt.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces PandaGPT, which combines the ImageBind multimodal encoder with the Vicuna LLM by training only a lightweight projection layer on aligned image-text pairs. It claims that this yields instruction-following capabilities on images and audio, plus emergent zero-shot cross-modal behaviors (e.g., composing semantics across video, depth, thermal, and IMU inputs) and natural handling of simultaneous multimodal inputs, all without any training data from non-image modalities.

Significance. If the emergent cross-modal composition claims hold under quantitative scrutiny, the approach would demonstrate an unusually data-efficient route to multimodal instruction following by exploiting pre-aligned embedding spaces. This could reduce the cost of extending LLMs beyond vision-language pairs and would be of broad interest to the multimodal learning community.

major comments (3)
  1. [Abstract] Abstract: the central claims of 'emergent, i.e. zero-shot, cross-modal behaviors' and 'compose their semantics naturally' rest exclusively on qualitative pilot demonstrations; no quantitative metrics, baselines, retrieval accuracies, or error analysis are reported, so the scope of the behaviors cannot be verified.
  2. [Method/Experiments] Method and Experiments: training occurs solely on ImageBind image embeddings paired with Vicuna; the zero-shot transfer to audio, video, depth, etc. therefore depends on the untested uniformity of ImageBind's cross-modal alignment, yet no ablation replaces ImageBind with a non-aligned encoder of matching dimensionality or reports any cross-modal task delta.
  3. [Experiments] Experiments: the reported tasks (detailed image description, video-inspired stories, audio question answering) are illustrated only by selected examples without controls for prompt sensitivity, output variability, or comparison against unimodal baselines or separate modality-specific models.
minor comments (1)
  1. [Abstract] The project page URL is given but no quantitative results or failure cases are linked from the paper itself.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the detailed and constructive comments. We believe our work demonstrates a promising direction for multimodal instruction following with minimal training data. Below we address each major comment point by point, and we will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims of 'emergent, i.e. zero-shot, cross-modal behaviors' and 'compose their semantics naturally' rest exclusively on qualitative pilot demonstrations; no quantitative metrics, baselines, retrieval accuracies, or error analysis are reported, so the scope of the behaviors cannot be verified.

    Authors: We acknowledge that the claims in the abstract are based on qualitative pilot demonstrations. This is because the work is positioned as an initial exploration of the approach. In the revised manuscript, we will expand the experiments section to include quantitative evaluations, such as human preference studies or accuracy metrics on specific tasks like audio question answering, along with error analysis to better verify the scope of the emergent behaviors. revision: yes

  2. Referee: [Method/Experiments] Method and Experiments: training occurs solely on ImageBind image embeddings paired with Vicuna; the zero-shot transfer to audio, video, depth, etc. therefore depends on the untested uniformity of ImageBind's cross-modal alignment, yet no ablation replaces ImageBind with a non-aligned encoder of matching dimensionality or reports any cross-modal task delta.

    Authors: We agree that the effectiveness depends on ImageBind's alignment quality. We chose not to include ablations with non-aligned encoders because the method specifically leverages the shared embedding space provided by ImageBind. Replacing it would fundamentally change the approach and likely require additional training data. However, we will add a new subsection discussing the role of pre-aligned embeddings and cite supporting literature on cross-modal alignment to strengthen this point. revision: partial

  3. Referee: [Experiments] Experiments: the reported tasks (detailed image description, video-inspired stories, audio question answering) are illustrated only by selected examples without controls for prompt sensitivity, output variability, or comparison against unimodal baselines or separate modality-specific models.

    Authors: The current experiments focus on qualitative demonstrations to showcase the capabilities. To address the concern, we will include additional examples with varied prompts to illustrate robustness, discuss output variability in the text, and add comparisons to unimodal baselines (e.g., Vicuna on text-only or image-only models) where feasible in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rely on external pre-trained models without internal reduction

full rationale

The paper trains a projection layer exclusively on aligned image-text pairs from ImageBind and Vicuna, then applies the same projected space to embeddings from other modalities (video, audio, etc.) to claim zero-shot cross-modal composition. No equation, parameter fit, or derivation within the paper reduces by construction to its own inputs or to a self-citation chain. The load-bearing premise—that ImageBind already produces a sufficiently uniform semantic space—is imported from the external ImageBind work rather than derived or fitted inside PandaGPT. This is a standard empirical reliance on pre-trained components and does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the pre-existing alignment quality of ImageBind and the instruction-following capacity of Vicuna; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption ImageBind produces a shared embedding space in which vectors from different modalities can be composed by a language model without modality-specific fine-tuning.
    Invoked to explain zero-shot transfer to video, audio, depth, thermal, and IMU inputs.

pith-pipeline@v0.9.0 · 5517 in / 1200 out tokens · 34882 ms · 2026-05-16T08:58:06.902090+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Foundation.LogicAsFunctionalEquation RCL_is_unique_functional_form_of_logic echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna... Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Cross-Modal Backdoors in Multimodal Large Language Models

    cs.CR 2026-05 unverdicted novelty 8.0

    Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.

  2. Do Audio-Visual Large Language Models Really See and Hear?

    cs.AI 2026-04 unverdicted novelty 8.0

    AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.

  3. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    cs.CL 2023-07 unverdicted novelty 7.0

    SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

  4. A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Coordinated multi-modal typographic attacks on MLLMs achieve 83.43% success rate versus 34.93% for single-modality attacks.

  5. Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

    cs.CV 2026-03 unverdicted novelty 6.0

    Motion-MLLM integrates IMU egomotion data into MLLMs using cascaded filtering and asymmetric fusion to ground visual content in physical trajectories for scale-aware 3D understanding, achieving competitive accuracy at...

  6. C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis

    cs.CL 2026-03 unverdicted novelty 6.0

    C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization ...

  7. Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    cs.CV 2025-05 unverdicted novelty 6.0

    Spatial-MLLM boosts MLLM spatial intelligence from 2D inputs via dual encoders initialized from geometry models plus space-aware sampling, claiming state-of-the-art results.

  8. MMBench: Is Your Multi-modal Model an All-around Player?

    cs.CV 2023-07 accept novelty 6.0

    MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.

  9. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  10. AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition

    cs.CV 2026-04 unverdicted novelty 5.0

    AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...

  11. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  12. ClimateVID -- Social Media Videos Analysis and Challenges Involved

    cs.CV 2026-04 unverdicted novelty 4.0

    Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.

  13. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  14. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  15. Qwen2-Audio Technical Report

    eess.AS 2024-07 unverdicted novelty 4.0

    Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.

  16. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

  17. The Rise and Potential of Large Language Model Based Agents: A Survey

    cs.AI 2023-09 accept novelty 4.0

    The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.

  18. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 18 Pith papers · 8 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736

  2. [2]

    Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi´c, Jason Rama- puram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems, 33:25–37

  3. [3]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems , 33:1877– 1901

  4. [4]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality

  5. [5]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 4171–4186

  6. [6]

    Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612

  7. [7]

    Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26

  8. [8]

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. arXiv preprint arXiv:2305.05665

  9. [9]

    Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022. Audioclip: Extending clip to image, text and audio. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 976–980. IEEE

  10. [10]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685

  11. [11]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597

  12. [12]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning

  13. [13]

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2021. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, pages 839–849

  14. [14]

    OpenAI. 2022. Gpt-4 technical report

  15. [15]

    OpenAI. 2022. Introducing chatgpt

  16. [16]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744. 15

  17. [17]

    Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Lingpeng Kong, and Tong Zhang. 2023. Detgpt: Detect what you need via reasoning

  18. [18]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR

  19. [19]

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training

  20. [20]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9

  21. [21]

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207

  22. [22]

    Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100

  23. [23]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems , 33:3008–3021

  24. [24]

    Yixuan Su, Tian Lan, and Deng Cai. 2023. Openalpaca: A fully open-source instruction- following model based on openllama. https://github.com/yxuansu/OpenAlpaca

  25. [25]

    Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier. 2022. Language models can see: plugging visual controls in text generation. arXiv preprint arXiv:2205.02655

  26. [26]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca

  27. [27]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  28. [28]

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652

  29. [29]

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu

  30. [30]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

  31. [31]

    Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-finetuned visual language model for video understanding

  32. [32]

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592. 16