arxiv: 2305.06355 · v2 · submitted 2023-05-10 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

VideoChat: Chat-Centric Video Understanding

Kunchang Li, Limin Wang, Ping Luo, Wenhai Wang, Yali Wang, Yinan He, Yi Wang, Yizhuo Li, Yu Qiao

Pith reviewed 2026-05-13 23:25 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords video understandinglarge language modelsmultimodal modelsinstruction tuningspatiotemporal reasoningchat-centric AI

0 comments

The pith

VideoChat links video models to language models with a trainable interface for conversational video analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoChat as an end-to-end system that joins video foundation models with large language models through a learnable neural interface. This design targets chat-style interactions with video, particularly for reasoning about time and space, locating events, and tracing causal links. The authors also release a dataset of thousands of videos paired with detailed descriptions and dialogues to train the system on these skills. If the interface works as intended, the result is a prototype that lets users discuss video content in natural language rather than through separate analysis tools.

Core claim

It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference.

What carries the argument

The learnable neural interface that aligns outputs from the video foundation model with inputs to the large language model for joint training on video instructions.

If this is right

Natural-language questions about what happens at specific times in a video become answerable in one system.
The same model can localize events and explain causal chains without separate modules for each task.
The released video-instruction dataset can be reused to train other multimodal chat systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling the interface to longer or streaming videos would test whether the current training setup generalizes beyond the dataset's clip lengths.
Adding audio tracks to the input could strengthen causal inferences that currently rely only on visual cues.
The open-source release invites direct comparisons with future models on the same instruction set to measure progress.

Load-bearing premise

Training the neural interface on the video-centric instruction dataset is sufficient to bridge the two models well enough for strong performance on spatiotemporal and causal tasks.

What would settle it

Quantitative tests on held-out videos showing VideoChat answers no more accurately than a frozen video captioner plus an off-the-shelf language model on questions about event timing, location, and cause would disprove the integration benefit.

read the original abstract

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoChat gives a workable early prototype for conversational video systems plus a new instruction dataset, but its performance claims rest only on qualitative examples with no numbers or baselines.

read the letter

The main thing to know is that this paper assembles an end-to-end system called VideoChat that routes video foundation model features through a learnable neural interface into a large language model, and it ships a fresh video-centric instruction dataset built around spatiotemporal and causal questions. That combination is the actual new piece; prior video-language work is referenced but this specific chat-centric framing and dataset focus are not already out there. They also release code and data, which lowers the barrier for anyone who wants to try the setup themselves. The architecture description is clear enough that a reader can see how the pieces fit without much guesswork. The paper does a reasonable job of laying out a simple prototype that could be extended for video search or interactive analysis tasks. The soft spot is the evaluation. Everything rests on preliminary qualitative examples that show the system handling some video conversations, but there are no reported metrics, no baseline comparisons on standard tasks like temporal action localization or causal QA, and no ablations on the interface or dataset. That leaves the claim of excelling at spatiotemporal reasoning and causal inference unverified for now. The bridging effectiveness of the neural interface is asserted rather than measured against alternatives. This is the kind of paper that would interest researchers building multimodal video-language models who need a concrete starting implementation and dataset to experiment with. A reader looking for ideas or a public resource would get value from it, while someone needing rigorous benchmarks would have to add those themselves. I would send it to peer review because the direction is timely, the resources are public, and the core idea is coherent even if the experiments need substantial strengthening before it could be published.

Referee Report

3 major / 2 minor

Summary. The paper introduces VideoChat, an end-to-end chat-centric video understanding system that integrates video foundation models with large language models via a learnable neural interface. A new video-centric instruction dataset is constructed to emphasize spatiotemporal reasoning and causal relationships, and the system is tuned on this data. The central claim is that the resulting model excels at spatiotemporal reasoning, event localization, and causal inference, with preliminary qualitative experiments presented as evidence of its potential as a prototype for future work.

Significance. If the bridging effectiveness of the learnable interface were quantitatively validated, the work would supply a useful open dataset and modular architecture for multimodal video chat systems. The absence of metrics, ablations, or baselines, however, leaves the performance claims unverified and reduces the immediate contribution to a preliminary demonstration rather than a substantiated advance.

major comments (3)

[Abstract, §4] Abstract and §4 (Experiments): the assertion that the system 'excels' in spatiotemporal reasoning, event localization, and causal relationship inference rests exclusively on preliminary qualitative examples. No quantitative metrics (e.g., accuracy, mAP, or causal QA scores), baselines (e.g., existing video-LLMs), or error analysis are reported, rendering the central performance claim unsupported.
[§3] §3 (Method): the learnable neural interface is presented as the key bridge between video foundation models and the LLM, yet no ablation isolating its contribution (e.g., frozen vs. trained interface) or analysis of how it encodes temporal or causal structure is provided. This leaves the mechanism by which the claimed capabilities arise unexamined.
[§2] §2 (Dataset): while the video-centric instruction dataset is described as capturing causal relationships, no statistics on the distribution of causal vs. descriptive queries, inter-annotator agreement, or verification that the data actually elicits the targeted reasoning are given. This weakens the justification for using the dataset to train the claimed capabilities.

minor comments (2)

[Abstract] The GitHub link is provided but the manuscript does not specify which components (interface weights, dataset splits, evaluation scripts) are released, making reproducibility harder to assess.
[§3] Notation for the neural interface (e.g., input/output dimensions, training objective) is introduced without an accompanying equation or diagram in §3, which would clarify the architecture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current version of the manuscript is preliminary in nature and that the performance claims would benefit from quantitative support, ablations, and dataset statistics. We will revise the paper accordingly to address these points while preserving its positioning as an initial prototype for chat-centric video understanding. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract, §4] Abstract and §4 (Experiments): the assertion that the system 'excels' in spatiotemporal reasoning, event localization, and causal relationship inference rests exclusively on preliminary qualitative examples. No quantitative metrics (e.g., accuracy, mAP, or causal QA scores), baselines (e.g., existing video-LLMs), or error analysis are reported, rendering the central performance claim unsupported.

Authors: We agree that the central claims rest on qualitative examples alone and that this leaves the performance assertions unsupported by quantitative evidence. In the revision we will replace the word 'excels' with more measured language that reflects the preliminary nature of the work. We will also add quantitative results on standard video QA and temporal grounding benchmarks, include direct comparisons to existing video-LLM baselines, and provide a basic error analysis of failure cases. revision: yes
Referee: [§3] §3 (Method): the learnable neural interface is presented as the key bridge between video foundation models and the LLM, yet no ablation isolating its contribution (e.g., frozen vs. trained interface) or analysis of how it encodes temporal or causal structure is provided. This leaves the mechanism by which the claimed capabilities arise unexamined.

Authors: We concur that the absence of ablations and mechanistic analysis leaves the role of the learnable interface insufficiently examined. In the revised manuscript we will report ablations that compare the trained interface against a frozen counterpart and will include visualizations (e.g., attention maps over video tokens) together with probing experiments to illustrate how temporal and causal information is represented. revision: yes
Referee: [§2] §2 (Dataset): while the video-centric instruction dataset is described as capturing causal relationships, no statistics on the distribution of causal vs. descriptive queries, inter-annotator agreement, or verification that the data actually elicits the targeted reasoning are given. This weakens the justification for using the dataset to train the claimed capabilities.

Authors: We acknowledge that additional dataset documentation is required. The revised version will include quantitative statistics on the proportion of causal, spatiotemporal, and descriptive queries, inter-annotator agreement figures for the collected conversations, and a description of the annotation protocol and quality-control steps used to ensure the data targets the intended reasoning skills. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical system assembly with external validation

full rationale

The paper describes an engineering integration of pre-existing video foundation models and LLMs via a new learnable neural interface, trained on a custom video-centric instruction dataset. Claims of performance in spatiotemporal reasoning etc. rest on preliminary qualitative experiments rather than any derivation, equation, or prediction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central result is an assembled prototype whose effectiveness is asserted via external demonstration, not closed-loop reduction to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the effectiveness of the neural interface trained on the new dataset and on the assumption that prior video foundation models supply adequate features; no new physical entities are introduced.

free parameters (1)

learnable neural interface parameters
Weights of the interface are fitted during training on the instruction dataset to align video features with the language model.

axioms (1)

domain assumption Pre-trained video foundation models extract features sufficient for spatiotemporal and causal reasoning when aligned via the interface.
Invoked when stating that the integration enables excelling performance without re-deriving video feature extraction.

pith-pipeline@v0.9.0 · 5450 in / 1366 out tokens · 55715 ms · 2026-05-13T23:25:41.740675+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
OZ-TAL: Online Zero-Shot Temporal Action Localization
cs.CV 2026-05 unverdicted novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory
cs.CV 2026-04 unverdicted novelty 7.0

A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
cs.CV 2024-07 unverdicted novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
Seeing Fast and Slow: Learning the Flow of Time in Videos
cs.CV 2026-04 unverdicted novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 24 Pith papers · 12 internal anchors

[1]

Openflamingo, March 2023

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023

work page 2023
[2]

Frozen in time: A joint video and image encoder for end-to-end retrieval

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision , 2021. 11

work page 2021
[3]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems , 33:1877–1901, 2020

work page 1901
[4]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021

work page 2021
[5]

Internvideo-ego4d: A pack of champion solutions to ego4d challenges

Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, et al. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022

work page arXiv 2022
[6]

Microsoft COCO Captions: Data Collection and Evaluation Server

Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023

work page 2023
[8]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

MOSS contributors. Moss. https://github.com/OpenLMLab/MOSS, 2023

work page 2023
[10]

Stablelm: Stability ai language models

StableLM contributors. Stablelm: Stability ai language models. https://github.com/stability-AI/ stableLM, 2023

work page 2023
[11]

An empirical study of training end-to-end vision-and- language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and- language transformers. In CVPR, 2022

work page 2022
[12]

Violet: End- to-end video-language transformers with masked visual-token modeling

Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End- to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021

work page arXiv 2021
[13]

Scaling up vision-language pre-training for image captioning

Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In CVPR, 2022

work page 2022
[14]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023

work page arXiv 2023
[15]

Tag2text: Guiding vision-language model via image tagging

Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023

work page arXiv 2023
[16]

Dolphin: General video interaction platform based on llms, 2023

Zehuan Huang, Haoran Feng, Chongzhi Zhang, Lu Sheng, Ziwei Liu, and Jing Shao. Dolphin: General video interaction platform based on llms, 2023. https://github.com/kaleido-lab/dolphin

work page 2023
[17]

Visual genome: Connecting language and vision using crowdsourced dense image annotations

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123:32–73, 2017

work page 2017
[18]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022

work page 2022
[20]

Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022

work page arXiv 2022
[21]

Unmasked teacher: Towards training-efficient video foundation models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023. 12

work page arXiv 2023
[22]

Lavender: Unifying video-language understanding as masked language modeling

Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022

work page arXiv 2022
[23]

Learning spatiotemporal features via video and text pair discrimination

Tianhao Li and Limin Wang. Learning spatiotemporal features via video and text pair discrimination. CoRR, abs/2001.05691, 2020

work page arXiv 2001
[24]

Taskmatrix

Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023

work page arXiv 2023
[25]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023

work page 2023
[26]

Internchat: Solving vision-centric tasks by interacting with chatbots beyond language

Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo, Jifeng Dai, and Yu Qiao. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. https://arxiv.org/abs/2305.05662, 2023

work page arXiv 2023
[27]

End-to-end learning of visual representations from uncurated instructional videos

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020

work page 2020
[28]

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021

work page arXiv 2021
[29]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. arXiv, 2023

work page 2023
[30]

Chatgpt: Optimizing language models for dialogue

TB OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022

work page 2022
[31]

Im2text: Describing images using 1 million captioned photographs

Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems , 24, 2011

work page 2011
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

work page 2022
[33]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020

work page 2020
[35]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018

work page 2018
[36]

How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021

work page arXiv 2021
[37]

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023

work page internal anchor Pith review arXiv 2023
[38]

Murphy, and Cordelia Schmid

Chen Sun, Austin Myers, Carl V ondrick, Kevin P. Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. ICCV, 2019

work page 2019
[39]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github. com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[41]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems , 2022

work page 2022
[42]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

All in one: Exploring unified video-language pre-training

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022

work page arXiv 2022
[44]

Videomae v2: Scaling video masked autoencoders with dual masking

Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

work page 2023
[45]

Internimage: Exploring large-scale vision foundation models with deformable convolutions

Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023

work page 2023
[46]

Internvideo: General video foundation models via generative and discriminative learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022

work page arXiv 2022
[47]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Grit: A generative region-to-text transformer for object understanding

Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022

work page arXiv 2022
[49]

Videoclip: Contrastive pre-training for zero-shot video-text understanding

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021

work page arXiv 2021
[50]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv, abs/2303.11381, 2023

work page internal anchor Pith review arXiv 2023
[51]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021

work page arXiv 2021
[52]

mplug-owl: Modularization empowers large language models with multimodality, 2023

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chaoya Jiang, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023

work page 2023
[53]

Florence: A new foundation model for computer vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021

work page arXiv 2021
[54]

Merlot reserve: Neural script knowledge through vision and language and sound

Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022

work page 2022
[55]

Merlot: Multimodal neural script knowledge models

Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. NeurIPS, 2021

work page 2021
[56]

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022

work page internal anchor Pith review arXiv 2022
[57]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Describe the following image concisely

Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. CVPR, 2020. 14 A Appendix Instruction for a brief description. Following the brief image instruction in LLaV A, we generate the video instruction with the aid of ChatGPT as shown in Table 8. In Stage1, we randomly sample the instruction to generate brief descriptions of im...

work page 2020