pith. machine review for the scientific record. sign in

arxiv: 2311.10122 · v3 · submitted 2023-11-16 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Bin Zhu, Jiaxi Cui, Li Yuan, Munan Ning, Peng Jin, Yang Ye

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified visual representationlarge vision-language modelimage video alignmentmulti-modal LLMvideo understandingmutual enhancement
0
0 comments X

The pith

By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies misalignment of image and video features before projection as the core obstacle preventing an LLM from learning joint multi-modal interactions. It shows that first mapping both into the same language feature space removes this barrier and allows training on a combined image-video dataset. The resulting Video-LLaVA model then exhibits mutual gains: image data helps video understanding and video data helps image understanding. This produces a simple baseline that beats prior specialized systems on nine image benchmarks and on four video datasets.

Core claim

We unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits, and outperforms Video-ChatGPT by 5.8 percent, 9.9 percent, 18.6 percent, and 10.1 percent on MSRVTT, MSVD, TGIF, and ActivityNet respectively.

What carries the argument

Alignment before projection, the step that places image and video features into a common language feature space prior to the LLM projection layers so that a single model can learn from mixed data.

If this is right

  • A single model trained on mixed image-video data outperforms models built specifically for images on nine image benchmarks.
  • The same model outperforms Video-ChatGPT by 5.8 to 18.6 percent on four standard video datasets.
  • Images and videos improve each other's performance when processed inside one unified representation.
  • A straightforward alignment step before projection is sufficient to create a working unified LVLM baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pre-projection alignment idea could be tested with additional modalities such as audio or depth maps.
  • If alignment before projection is the decisive factor, then future work could reduce emphasis on ever-more-complex projection layers.
  • Scaling the mixed dataset size while keeping the unified representation fixed would test whether the mutual-benefit effect grows or saturates.

Load-bearing premise

The main difficulty for an LLM with multi-modal inputs is the absence of unified tokenization for images and videos before the projection layers are applied.

What would settle it

Train a non-unified model that still uses separate image and video encoders but receives the same mixed dataset and check whether it matches or exceeds Video-LLaVA on both image and video benchmarks.

read the original abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Video-LLaVA, an LVLM that aligns image and video features into a shared language feature space prior to projection into the LLM. This unified representation enables joint training on mixed image-video datasets, yielding mutual performance gains. The model reports state-of-the-art results on 9 image benchmarks (across 5 QA datasets and 4 toolkits) and outperforms Video-ChatGPT by 5.8–18.6% on four video datasets (MSRVTT, MSVD, TGIF, ActivityNet).

Significance. If the empirical link between pre-projection alignment and the observed mutual enhancement holds, the work supplies a simple, reproducible baseline for unified LVLMs. The public code release strengthens the contribution by enabling direct verification of the mixed-training protocol and benchmark numbers.

major comments (3)
  1. [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.
  2. [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.
  3. [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.
minor comments (2)
  1. [Abstract] The abstract states '9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits'; the exact mapping between these counts and the tables in §4.1 should be clarified for reproducibility.
  2. [§3.1] Notation for the unified visual token space (e.g., the symbol used for the aligned feature before the LLM projector) is introduced inconsistently between §3.1 and Figure 2.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address each major point below and will update the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The alignment-before-projection step is described at a high level, but the manuscript does not specify whether the alignment loss is applied to frozen or jointly optimized encoders, nor the exact form of the alignment objective (contrastive, reconstruction, etc.). Without this, the causal contribution of the alignment step to the reported gains cannot be isolated from the mixed-dataset training itself.

    Authors: We thank the referee for highlighting this omission. The alignment is performed with a contrastive loss between the visual features and language embeddings while jointly optimizing the encoders; the encoders are not frozen. We will revise Section 3 to include the precise loss formulation, optimization schedule, and training details so that the contribution of the alignment step can be more clearly isolated. revision: yes

  2. Referee: [§4.2] §4.2 (Ablation studies): No ablation table isolates the effect of pre-projection alignment versus post-projection fusion or separate image/video projectors. The central claim that alignment enables mutual enhancement therefore rests on the headline benchmark numbers alone rather than controlled comparisons.

    Authors: We agree that a controlled ablation would strengthen the central claim. In the revised manuscript we will add an ablation study in Section 4.2 that directly compares the pre-projection unified alignment against (i) post-projection fusion and (ii) separate image/video projectors while keeping all other factors fixed. revision: yes

  3. Referee: [Table 2] Table 2 (video results): The 5.8–18.6% gains over Video-ChatGPT are reported without standard deviations or multiple-run statistics; given that Video-ChatGPT itself uses a different projector and training schedule, it is unclear whether the margin is attributable to the unified representation or to other hyper-parameter differences.

    Authors: We acknowledge that variance statistics would be preferable. Due to the high computational cost of LVLM training we report single-run results, which is standard practice in the field. We will add a clarifying note in the revised paper stating this limitation and pointing out that the observed gains are consistent across four distinct video benchmarks and are accompanied by mutual improvements on image tasks, supporting attribution to the unified representation rather than hyper-parameter differences alone. revision: partial

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of early alignment rather than on new theoretical axioms or invented entities.

free parameters (1)
  • projection and training hyperparameters
    Standard deep-learning hyperparameters required to train the model; not enumerated in the abstract.
axioms (1)
  • domain assumption Transformer-based LLMs can integrate aligned visual tokens effectively
    Background assumption inherited from prior LVLM work.

pith-pipeline@v0.9.0 · 5584 in / 1101 out tokens · 57514 ms · 2026-05-14T18:00:44.719539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  2. EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

    cs.CV 2026-05 unverdicted novelty 7.0

    EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.

  3. LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...

  4. Grounding Video Reasoning in Physical Signals

    cs.CV 2026-04 unverdicted novelty 7.0

    A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...

  5. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  6. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  7. Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

    cs.CV 2026-03 unverdicted novelty 7.0

    SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

  8. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  9. SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.

  10. WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization

    cs.CV 2026-05 unverdicted novelty 6.0

    WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.

  11. See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

    cs.CV 2026-04 unverdicted novelty 6.0

    ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.

  12. UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels

    cs.LG 2026-04 unverdicted novelty 6.0

    UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.

  13. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  14. ViLL-E: Video LLM Embeddings for Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

  15. CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.

  16. Spatio-Temporal Grounding of Large Language Models from Perception Streams

    cs.RO 2026-04 unverdicted novelty 6.0

    FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 whil...

  17. Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.

  18. ImgEdit: A Unified Image Editing Dataset and Benchmark

    cs.CV 2025-05 conditional novelty 6.0

    ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.

  19. UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    cs.CV 2025-06 unverdicted novelty 5.0

    UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

  20. LLaVA-OneVision: Easy Visual Task Transfer

    cs.CV 2024-08 unverdicted novelty 5.0

    LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.

  21. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  22. VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    cs.CV 2024-06 unverdicted novelty 4.0

    VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

  23. A Survey on Hallucination in Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 3.0

    This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 23 Pith papers · 21 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716--23736

  2. [3]

    Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728--1738

  3. [5]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  4. [6]

    David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190--200

  5. [8]

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90\ See https://vicuna. lmsys. org (accessed 14 April 2023)

  6. [9]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. https://arxiv.org/abs/2305.06500 Instructblip: Towards general-purpose vision-language models with instruction tuning . Preprint, arXiv:2305.06500

  7. [12]

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180--15190

  8. [14]

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904--6913

  9. [15]

    Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. 2018. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608--3617

  10. [17]

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000--16009

  11. [19]

    Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700--6709

  12. [21]

    Yunseok Jang, Yale Song, Youngjae Yu, Youngjin Kim, and Gunhee Kim. 2017. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758--2766

  13. [23]

    Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583--5594. PMLR

  14. [24]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. 2023. https://arxiv.org/abs/2306.16527 Obelics: An open web-scale filtered dataset of interleaved image-text documents . Preprint, arXiv:2306.16527

  15. [27]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888--12900. PMLR

  16. [28]

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. 2021. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694--9705

  17. [35]

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. 2022. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507--2521

  18. [39]

    OpenAI. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . Preprint, arXiv:2303.08774

  19. [40]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730--27744

  20. [42]

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556--2565

  21. [44]

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317--8326

  22. [46]

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Stanford alpaca: An instruction-following llama model

  23. [50]

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288--5296

  24. [55]

    Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. 2019. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127--9134

  25. [60]

    Improved Baselines with Visual Instruction Tuning

    Improved Baselines with Visual Instruction Tuning , author=. arXiv preprint arXiv:2310.03744 , year=

  26. [61]

    Visual Instruction Tuning

    Visual instruction tuning , author=. arXiv preprint arXiv:2304.08485 , year=

  27. [62]

    Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. arXiv preprint arXiv:2306.05424 , year=

  28. [63]

    VideoChat: Chat-Centric Video Understanding

    Videochat: Chat-centric video understanding , author=. arXiv preprint arXiv:2305.06355 , year=

  29. [64]

    arXiv preprint arXiv:2306.07207 , year=

    Valley: Video Assistant with Large Language model Enhanced abilitY , author=. arXiv preprint arXiv:2306.07207 , year=

  30. [65]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  31. [66]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  32. [67]

    Stanford alpaca: An instruction-following llama model , author=

  33. [68]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  34. [69]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  35. [70]

    See https://vicuna

    Vicuna: An open-source chatbot impressing gpt-4 with 90\ author=. See https://vicuna. lmsys. org (accessed 14 April 2023) , year=

  36. [71]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  37. [72]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  38. [73]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Visual chatgpt: Talking, drawing and editing with visual foundation models , author=. arXiv preprint arXiv:2303.04671 , year=

  39. [74]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface , author=. arXiv preprint arXiv:2303.17580 , year=

  40. [75]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Mm-react: Prompting chatgpt for multimodal reasoning and action , author=. arXiv preprint arXiv:2303.11381 , year=

  41. [76]

    Vipergpt: Visual inference via python execution for reasoning

    Vipergpt: Visual inference via python execution for reasoning , author=. arXiv preprint arXiv:2303.08128 , year=

  42. [77]

    2023 , eprint=

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=

  43. [78]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

  44. [79]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv preprint arXiv:2304.14178 , year=

  45. [80]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Video-llama: An instruction-tuned audio-visual language model for video understanding , author=. arXiv preprint arXiv:2306.02858 , year=

  46. [81]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv preprint arXiv:2303.16199 , year=

  47. [82]

    arXiv preprint arXiv:2304.15010 , year=

    Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv preprint arXiv:2304.15010 , year=

  48. [83]

    arXiv preprint arXiv:2309.03905 , year=

    Imagebind-llm: Multi-modality instruction tuning , author=. arXiv preprint arXiv:2309.03905 , year=

  49. [84]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Imagebind: One embedding space to bind them all , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  50. [85]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  51. [86]

    International Conference on Machine Learning , pages=

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  52. [87]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  53. [88]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Gqa: A new dataset for real-world visual reasoning and compositional question answering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  54. [89]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Vizwiz grand challenge: Answering visual questions from blind people , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  55. [90]

    Advances in Neural Information Processing Systems , volume=

    Learn to explain: Multimodal reasoning via thought chains for science question answering , author=. Advances in Neural Information Processing Systems , volume=

  56. [91]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  57. [92]

    Evaluating Object Hallucination in Large Vision-Language Models

    Evaluating object hallucination in large vision-language models , author=. arXiv preprint arXiv:2305.10355 , year=

  58. [93]

    MMBench: Is Your Multi-modal Model an All-around Player?

    MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=

  59. [94]

    MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

    Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. arXiv preprint arXiv:2308.02490 , year=

  60. [95]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Msr-vtt: A large video description dataset for bridging video and language , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  61. [96]

    Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

    Collecting highly parallel data for paraphrase evaluation , author=. Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies , pages=

  62. [97]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Tgif-qa: Toward spatio-temporal reasoning in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  63. [98]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Activitynet-qa: A dataset for understanding complex web videos via question answering , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  64. [99]

    arXiv preprint arXiv:2004.07159 , year=

    Palm: Pre-training an autoencoding&autoregressive language model for context-conditioned generation , author=. arXiv preprint arXiv:2004.07159 , year=

  65. [100]

    PaLM 2 Technical Report

    Palm 2 technical report , author=. arXiv preprint arXiv:2305.10403 , year=

  66. [101]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

  67. [102]

    Advances in Neural Information Processing Systems , volume=

    Flamingo: a visual language model for few-shot learning , author=. Advances in Neural Information Processing Systems , volume=

  68. [103]

    Advances in neural information processing systems , volume=

    Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=

  69. [104]

    International Conference on Machine Learning , pages=

    Vilt: Vision-and-language transformer without convolution or region supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  70. [105]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv preprint arXiv:2301.12597 , year=

  71. [106]

    Otter: A multi-modal model with in-context instruction tuning

    Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=

  72. [107]

    Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

    LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment , author=. arXiv preprint arXiv:2310.01852 , year=

  73. [108]

    arXiv preprint arXiv:2305.04790 , year=

    Multimodal-gpt: A vision and language model for dialogue with humans , author=. arXiv preprint arXiv:2305.04790 , year=

  74. [109]

    arXiv preprint arXiv:2311.08046 , year=

    Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding , author=. arXiv preprint arXiv:2311.08046 , year=

  75. [110]

    2023 , eprint=

    OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents , author=. 2023 , eprint=

  76. [111]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  77. [112]

    Proceedings of the IEEE international conference on computer vision , pages=

    Vqa: Visual question answering , author=. Proceedings of the IEEE international conference on computer vision , pages=

  78. [113]

    arXiv preprint arXiv:2305.04160 , year=

    X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages , author=. arXiv preprint arXiv:2305.04160 , year=

  79. [114]

    arXiv preprint arXiv:2306.09093 , year=

    Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration , author=. arXiv preprint arXiv:2306.09093 , year=

  80. [115]

    doi:10.5281/zenodo.5143773 , url =

    Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

Showing first 80 references.