pith. machine review for the scientific record. sign in

arxiv: 2305.06355 · v2 · submitted 2023-05-10 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

VideoChat: Chat-Centric Video Understanding

Kunchang Li, Limin Wang, Ping Luo, Wenhai Wang, Yali Wang, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao

Pith reviewed 2026-05-13 23:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords video understandinglarge language modelsmultimodal modelsinstruction tuningspatiotemporal reasoningchat-centric AI
0
0 comments X

The pith

VideoChat links video models to language models with a trainable interface for conversational video analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoChat as an end-to-end system that joins video foundation models with large language models through a learnable neural interface. This design targets chat-style interactions with video, particularly for reasoning about time and space, locating events, and tracing causal links. The authors also release a dataset of thousands of videos paired with detailed descriptions and dialogues to train the system on these skills. If the interface works as intended, the result is a prototype that lets users discuss video content in natural language rather than through separate analysis tools.

Core claim

It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference.

What carries the argument

The learnable neural interface that aligns outputs from the video foundation model with inputs to the large language model for joint training on video instructions.

If this is right

  • Natural-language questions about what happens at specific times in a video become answerable in one system.
  • The same model can localize events and explain causal chains without separate modules for each task.
  • The released video-instruction dataset can be reused to train other multimodal chat systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Scaling the interface to longer or streaming videos would test whether the current training setup generalizes beyond the dataset's clip lengths.
  • Adding audio tracks to the input could strengthen causal inferences that currently rely only on visual cues.
  • The open-source release invites direct comparisons with future models on the same instruction set to measure progress.

Load-bearing premise

Training the neural interface on the video-centric instruction dataset is sufficient to bridge the two models well enough for strong performance on spatiotemporal and causal tasks.

What would settle it

Quantitative tests on held-out videos showing VideoChat answers no more accurately than a frozen video captioner plus an off-the-shelf language model on questions about event timing, location, and cause would disprove the integration benefit.

read the original abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces VideoChat, an end-to-end chat-centric video understanding system that integrates video foundation models with large language models via a learnable neural interface. A new video-centric instruction dataset is constructed to emphasize spatiotemporal reasoning and causal relationships, and the system is tuned on this data. The central claim is that the resulting model excels at spatiotemporal reasoning, event localization, and causal inference, with preliminary qualitative experiments presented as evidence of its potential as a prototype for future work.

Significance. If the bridging effectiveness of the learnable interface were quantitatively validated, the work would supply a useful open dataset and modular architecture for multimodal video chat systems. The absence of metrics, ablations, or baselines, however, leaves the performance claims unverified and reduces the immediate contribution to a preliminary demonstration rather than a substantiated advance.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (Experiments): the assertion that the system 'excels' in spatiotemporal reasoning, event localization, and causal relationship inference rests exclusively on preliminary qualitative examples. No quantitative metrics (e.g., accuracy, mAP, or causal QA scores), baselines (e.g., existing video-LLMs), or error analysis are reported, rendering the central performance claim unsupported.
  2. [§3] §3 (Method): the learnable neural interface is presented as the key bridge between video foundation models and the LLM, yet no ablation isolating its contribution (e.g., frozen vs. trained interface) or analysis of how it encodes temporal or causal structure is provided. This leaves the mechanism by which the claimed capabilities arise unexamined.
  3. [§2] §2 (Dataset): while the video-centric instruction dataset is described as capturing causal relationships, no statistics on the distribution of causal vs. descriptive queries, inter-annotator agreement, or verification that the data actually elicits the targeted reasoning are given. This weakens the justification for using the dataset to train the claimed capabilities.
minor comments (2)
  1. [Abstract] The GitHub link is provided but the manuscript does not specify which components (interface weights, dataset splits, evaluation scripts) are released, making reproducibility harder to assess.
  2. [§3] Notation for the neural interface (e.g., input/output dimensions, training objective) is introduced without an accompanying equation or diagram in §3, which would clarify the architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current version of the manuscript is preliminary in nature and that the performance claims would benefit from quantitative support, ablations, and dataset statistics. We will revise the paper accordingly to address these points while preserving its positioning as an initial prototype for chat-centric video understanding. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): the assertion that the system 'excels' in spatiotemporal reasoning, event localization, and causal relationship inference rests exclusively on preliminary qualitative examples. No quantitative metrics (e.g., accuracy, mAP, or causal QA scores), baselines (e.g., existing video-LLMs), or error analysis are reported, rendering the central performance claim unsupported.

    Authors: We agree that the central claims rest on qualitative examples alone and that this leaves the performance assertions unsupported by quantitative evidence. In the revision we will replace the word 'excels' with more measured language that reflects the preliminary nature of the work. We will also add quantitative results on standard video QA and temporal grounding benchmarks, include direct comparisons to existing video-LLM baselines, and provide a basic error analysis of failure cases. revision: yes

  2. Referee: [§3] §3 (Method): the learnable neural interface is presented as the key bridge between video foundation models and the LLM, yet no ablation isolating its contribution (e.g., frozen vs. trained interface) or analysis of how it encodes temporal or causal structure is provided. This leaves the mechanism by which the claimed capabilities arise unexamined.

    Authors: We concur that the absence of ablations and mechanistic analysis leaves the role of the learnable interface insufficiently examined. In the revised manuscript we will report ablations that compare the trained interface against a frozen counterpart and will include visualizations (e.g., attention maps over video tokens) together with probing experiments to illustrate how temporal and causal information is represented. revision: yes

  3. Referee: [§2] §2 (Dataset): while the video-centric instruction dataset is described as capturing causal relationships, no statistics on the distribution of causal vs. descriptive queries, inter-annotator agreement, or verification that the data actually elicits the targeted reasoning are given. This weakens the justification for using the dataset to train the claimed capabilities.

    Authors: We acknowledge that additional dataset documentation is required. The revised version will include quantitative statistics on the proportion of causal, spatiotemporal, and descriptive queries, inter-annotator agreement figures for the collected conversations, and a description of the annotation protocol and quality-control steps used to ensure the data targets the intended reasoning skills. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical system assembly with external validation

full rationale

The paper describes an engineering integration of pre-existing video foundation models and LLMs via a new learnable neural interface, trained on a custom video-centric instruction dataset. Claims of performance in spatiotemporal reasoning etc. rest on preliminary qualitative experiments rather than any derivation, equation, or prediction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central result is an assembled prototype whose effectiveness is asserted via external demonstration, not closed-loop reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of the neural interface trained on the new dataset and on the assumption that prior video foundation models supply adequate features; no new physical entities are introduced.

free parameters (1)
  • learnable neural interface parameters
    Weights of the interface are fitted during training on the instruction dataset to align video features with the language model.
axioms (1)
  • domain assumption Pre-trained video foundation models extract features sufficient for spatiotemporal and causal reasoning when aligned via the interface.
    Invoked when stating that the integration enables excelling performance without re-deriving video feature extraction.

pith-pipeline@v0.9.0 · 5450 in / 1366 out tokens · 55715 ms · 2026-05-13T23:25:41.740675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  2. OZ-TAL: Online Zero-Shot Temporal Action Localization

    cs.CV 2026-05 unverdicted novelty 7.0

    Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

  3. IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

    cs.CV 2026-04 unverdicted novelty 7.0

    A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.

  4. OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

  5. SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

    cs.CV 2026-04 unverdicted novelty 7.0

    SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

  6. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  7. MLVU: Benchmarking Multi-task Long Video Understanding

    cs.CV 2024-06 conditional novelty 7.0

    MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

  8. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    cs.CL 2023-07 unverdicted novelty 7.0

    SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

  9. Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.

  10. Seeing Fast and Slow: Learning the Flow of Time in Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

  11. One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...

  12. ViLL-E: Video LLM Embeddings for Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

  13. POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.

  14. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  15. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  16. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  17. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  18. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  19. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  20. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  21. CurEvo: Curriculum-Guided Self-Evolution for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

  22. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  23. VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    cs.CV 2025-01 unverdicted novelty 4.0

    VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

  24. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 24 Pith papers · 12 internal anchors

  1. [1]

    Openflamingo, March 2023

    Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023

  2. [2]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision , 2021. 11

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems , 33:1877–1901, 2020

  4. [4]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021

  5. [5]

    Internvideo-ego4d: A pack of champion solutions to ego4d challenges

    Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, et al. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022

  6. [6]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

  7. [7]

    Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

  8. [8]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  9. [9]

    MOSS contributors. Moss. https://github.com/OpenLMLab/MOSS, 2023

  10. [10]

    Stablelm: Stability ai language models

    StableLM contributors. Stablelm: Stability ai language models. https://github.com/stability-AI/ stableLM, 2023

  11. [11]

    An empirical study of training end-to-end vision-and- language transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and- language transformers. In CVPR, 2022

  12. [12]

    Violet: End- to-end video-language transformers with masked visual-token modeling

    Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End- to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021

  13. [13]

    Scaling up vision-language pre-training for image captioning

    Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In CVPR, 2022

  14. [14]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

  15. [15]

    Tag2text: Guiding vision-language model via image tagging

    Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023

  16. [16]

    Dolphin: General video interaction platform based on llms, 2023

    Zehuan Huang, Haoran Feng, Chongzhi Zhang, Lu Sheng, Ziwei Liu, and Jing Shao. Dolphin: General video interaction platform based on llms, 2023. https://github.com/kaleido-lab/dolphin

  17. [17]

    Visual genome: Connecting language and vision using crowdsourced dense image annotations

    Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123:32–73, 2017

  18. [18]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

  19. [19]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022

  20. [20]

    Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022

  21. [21]

    Unmasked teacher: Towards training-efficient video foundation models

    Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023. 12

  22. [22]

    Lavender: Unifying video-language understanding as masked language modeling

    Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022

  23. [23]

    Learning spatiotemporal features via video and text pair discrimination

    Tianhao Li and Limin Wang. Learning spatiotemporal features via video and text pair discrimination. CoRR, abs/2001.05691, 2020

  24. [24]

    Taskmatrix

    Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023

  25. [25]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023

  26. [26]

    Internchat: Solving vision-centric tasks by interacting with chatbots beyond language

    Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo, Jifeng Dai, and Yu Qiao. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. https://arxiv.org/abs/2305.05662, 2023

  27. [27]

    End-to-end learning of visual representations from uncurated instructional videos

    Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020

  28. [28]

    Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021

  29. [29]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. arXiv, 2023

  30. [30]

    Chatgpt: Optimizing language models for dialogue

    TB OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022

  31. [31]

    Im2text: Describing images using 1 million captioned photographs

    Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems , 24, 2011

  32. [32]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

  33. [33]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022

  34. [34]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020

  35. [35]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018

  36. [36]

    How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

    Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

  37. [37]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023

  38. [38]

    Murphy, and Cordelia Schmid

    Chen Sun, Austin Myers, Carl V ondrick, Kevin P. Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. ICCV, 2019

  39. [39]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  40. [40]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

  41. [41]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems , 2022

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 13

  43. [43]

    All in one: Exploring unified video-language pre-training

    Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022

  44. [44]

    Videomae v2: Scaling video masked autoencoders with dual masking

    Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  45. [45]

    Internimage: Exploring large-scale vision foundation models with deformable convolutions

    Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023

  46. [46]

    Internvideo: General video foundation models via generative and discriminative learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022

  47. [47]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023

  48. [48]

    Grit: A generative region-to-text transformer for object understanding

    Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022

  49. [49]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021

  50. [50]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv, abs/2303.11381, 2023

  51. [51]

    Filip: Fine-grained interactive language-image pre-training

    Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021

  52. [52]

    mplug-owl: Modularization empowers large language models with multimodality, 2023

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chaoya Jiang, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023

  53. [53]

    Florence: A new foundation model for computer vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021

  54. [54]

    Merlot reserve: Neural script knowledge through vision and language and sound

    Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022

  55. [55]

    Merlot: Multimodal neural script knowledge models

    Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. NeurIPS, 2021

  56. [56]

    GLM-130B: An Open Bilingual Pre-trained Model

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022

  57. [57]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  58. [59]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  59. [60]

    Describe the following image concisely

    Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. CVPR, 2020. 14 A Appendix Instruction for a brief description. Following the brief image instruction in LLaV A, we generate the video instruction with the aid of ChatGPT as shown in Table 8. In Stage1, we randomly sample the instruction to generate brief descriptions of im...