Recognition: 2 theorem links
· Lean TheoremVideoChat: Chat-Centric Video Understanding
Pith reviewed 2026-05-13 23:25 UTC · model grok-4.3
The pith
VideoChat links video models to language models with a trainable interface for conversational video analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference.
What carries the argument
The learnable neural interface that aligns outputs from the video foundation model with inputs to the large language model for joint training on video instructions.
If this is right
- Natural-language questions about what happens at specific times in a video become answerable in one system.
- The same model can localize events and explain causal chains without separate modules for each task.
- The released video-instruction dataset can be reused to train other multimodal chat systems.
Where Pith is reading between the lines
- Scaling the interface to longer or streaming videos would test whether the current training setup generalizes beyond the dataset's clip lengths.
- Adding audio tracks to the input could strengthen causal inferences that currently rely only on visual cues.
- The open-source release invites direct comparisons with future models on the same instruction set to measure progress.
Load-bearing premise
Training the neural interface on the video-centric instruction dataset is sufficient to bridge the two models well enough for strong performance on spatiotemporal and causal tasks.
What would settle it
Quantitative tests on held-out videos showing VideoChat answers no more accurately than a frozen video captioner plus an off-the-shelf language model on questions about event timing, location, and cause would disprove the integration benefit.
read the original abstract
In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat. It integrates video foundation models and large language models via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. To instructively tune this system, we build a video-centric instruction dataset, composed of thousands of videos associated with detailed descriptions and conversations. This dataset emphasizes spatiotemporal reasoning and captures causal relationships, providing a valuable asset for training our chat-centric video understanding system. Preliminary qualitative experiments demonstrate the potential of our system across a broad spectrum of video applications, which could serve as a simple prototype system for future research on chat-centric video understanding. Access our code and data at https://github.com/OpenGVLab/Ask-Anything
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VideoChat, an end-to-end chat-centric video understanding system that integrates video foundation models with large language models via a learnable neural interface. A new video-centric instruction dataset is constructed to emphasize spatiotemporal reasoning and causal relationships, and the system is tuned on this data. The central claim is that the resulting model excels at spatiotemporal reasoning, event localization, and causal inference, with preliminary qualitative experiments presented as evidence of its potential as a prototype for future work.
Significance. If the bridging effectiveness of the learnable interface were quantitatively validated, the work would supply a useful open dataset and modular architecture for multimodal video chat systems. The absence of metrics, ablations, or baselines, however, leaves the performance claims unverified and reduces the immediate contribution to a preliminary demonstration rather than a substantiated advance.
major comments (3)
- [Abstract, §4] Abstract and §4 (Experiments): the assertion that the system 'excels' in spatiotemporal reasoning, event localization, and causal relationship inference rests exclusively on preliminary qualitative examples. No quantitative metrics (e.g., accuracy, mAP, or causal QA scores), baselines (e.g., existing video-LLMs), or error analysis are reported, rendering the central performance claim unsupported.
- [§3] §3 (Method): the learnable neural interface is presented as the key bridge between video foundation models and the LLM, yet no ablation isolating its contribution (e.g., frozen vs. trained interface) or analysis of how it encodes temporal or causal structure is provided. This leaves the mechanism by which the claimed capabilities arise unexamined.
- [§2] §2 (Dataset): while the video-centric instruction dataset is described as capturing causal relationships, no statistics on the distribution of causal vs. descriptive queries, inter-annotator agreement, or verification that the data actually elicits the targeted reasoning are given. This weakens the justification for using the dataset to train the claimed capabilities.
minor comments (2)
- [Abstract] The GitHub link is provided but the manuscript does not specify which components (interface weights, dataset splits, evaluation scripts) are released, making reproducibility harder to assess.
- [§3] Notation for the neural interface (e.g., input/output dimensions, training objective) is introduced without an accompanying equation or diagram in §3, which would clarify the architecture.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current version of the manuscript is preliminary in nature and that the performance claims would benefit from quantitative support, ablations, and dataset statistics. We will revise the paper accordingly to address these points while preserving its positioning as an initial prototype for chat-centric video understanding. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): the assertion that the system 'excels' in spatiotemporal reasoning, event localization, and causal relationship inference rests exclusively on preliminary qualitative examples. No quantitative metrics (e.g., accuracy, mAP, or causal QA scores), baselines (e.g., existing video-LLMs), or error analysis are reported, rendering the central performance claim unsupported.
Authors: We agree that the central claims rest on qualitative examples alone and that this leaves the performance assertions unsupported by quantitative evidence. In the revision we will replace the word 'excels' with more measured language that reflects the preliminary nature of the work. We will also add quantitative results on standard video QA and temporal grounding benchmarks, include direct comparisons to existing video-LLM baselines, and provide a basic error analysis of failure cases. revision: yes
-
Referee: [§3] §3 (Method): the learnable neural interface is presented as the key bridge between video foundation models and the LLM, yet no ablation isolating its contribution (e.g., frozen vs. trained interface) or analysis of how it encodes temporal or causal structure is provided. This leaves the mechanism by which the claimed capabilities arise unexamined.
Authors: We concur that the absence of ablations and mechanistic analysis leaves the role of the learnable interface insufficiently examined. In the revised manuscript we will report ablations that compare the trained interface against a frozen counterpart and will include visualizations (e.g., attention maps over video tokens) together with probing experiments to illustrate how temporal and causal information is represented. revision: yes
-
Referee: [§2] §2 (Dataset): while the video-centric instruction dataset is described as capturing causal relationships, no statistics on the distribution of causal vs. descriptive queries, inter-annotator agreement, or verification that the data actually elicits the targeted reasoning are given. This weakens the justification for using the dataset to train the claimed capabilities.
Authors: We acknowledge that additional dataset documentation is required. The revised version will include quantitative statistics on the proportion of causal, spatiotemporal, and descriptive queries, inter-annotator agreement figures for the collected conversations, and a description of the annotation protocol and quality-control steps used to ensure the data targets the intended reasoning skills. revision: yes
Circularity Check
No significant circularity: empirical system assembly with external validation
full rationale
The paper describes an engineering integration of pre-existing video foundation models and LLMs via a new learnable neural interface, trained on a custom video-centric instruction dataset. Claims of performance in spatiotemporal reasoning etc. rest on preliminary qualitative experiments rather than any derivation, equation, or prediction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The central result is an assembled prototype whose effectiveness is asserted via external demonstration, not closed-loop reduction to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable neural interface parameters
axioms (1)
- domain assumption Pre-trained video foundation models extract features sufficient for spatiotemporal and causal reasoning when aligned via the interface.
Forward citations
Cited by 24 Pith papers
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
OZ-TAL: Online Zero-Shot Temporal Action Localization
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
-
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory
A contract-based multi-agent system maintains a claim-level semantic memory for long videos, enabling targeted corrections that raise VQA accuracy from 0.71 to 0.79 and cut human arbitration cost by 4.8x on VidOR.
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
-
Seeing Fast and Slow: Learning the Flow of Time in Videos
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
-
[1]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023
work page 2023
-
[2]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision , 2021. 11
work page 2021
-
[3]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems , 33:1877–1901, 2020
work page 1901
-
[4]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021
work page 2021
-
[5]
Internvideo-ego4d: A pack of champion solutions to ego4d challenges
Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, et al. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022
-
[6]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[7]
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023
work page 2023
-
[8]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
MOSS contributors. Moss. https://github.com/OpenLMLab/MOSS, 2023
work page 2023
-
[10]
Stablelm: Stability ai language models
StableLM contributors. Stablelm: Stability ai language models. https://github.com/stability-AI/ stableLM, 2023
work page 2023
-
[11]
An empirical study of training end-to-end vision-and- language transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and- language transformers. In CVPR, 2022
work page 2022
-
[12]
Violet: End- to-end video-language transformers with masked visual-token modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End- to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021
-
[13]
Scaling up vision-language pre-training for image captioning
Xiaowei Hu, Zhe Gan, Jianfeng Wang, Zhengyuan Yang, Zicheng Liu, Yumao Lu, and Lijuan Wang. Scaling up vision-language pre-training for image captioning. In CVPR, 2022
work page 2022
-
[14]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023
-
[15]
Tag2text: Guiding vision-language model via image tagging
Xinyu Huang, Youcai Zhang, Jinyu Ma, Weiwei Tian, Rui Feng, Yuejie Zhang, Yaqian Li, Yandong Guo, and Lei Zhang. Tag2text: Guiding vision-language model via image tagging. arXiv preprint arXiv:2303.05657, 2023
-
[16]
Dolphin: General video interaction platform based on llms, 2023
Zehuan Huang, Haoran Feng, Chongzhi Zhang, Lu Sheng, Ziwei Liu, and Jing Shao. Dolphin: General video interaction platform based on llms, 2023. https://github.com/kaleido-lab/dolphin
work page 2023
-
[17]
Visual genome: Connecting language and vision using crowdsourced dense image annotations
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision , 123:32–73, 2017
work page 2017
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022
work page 2022
-
[20]
Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022
-
[21]
Unmasked teacher: Towards training-efficient video foundation models
Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023. 12
-
[22]
Lavender: Unifying video-language understanding as masked language modeling
Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-language understanding as masked language modeling. arXiv preprint arXiv:2206.07160, 2022
-
[23]
Learning spatiotemporal features via video and text pair discrimination
Tianhao Li and Limin Wang. Learning spatiotemporal features via video and text pair discrimination. CoRR, abs/2001.05691, 2020
-
[24]
Yaobo Liang, Chenfei Wu, Ting Song, Wenshan Wu, Yan Xia, Yu Liu, Yang Ou, Shuai Lu, Lei Ji, Shaoguang Mao, et al. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434, 2023
-
[25]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. 2023
work page 2023
-
[26]
Internchat: Solving vision-centric tasks by interacting with chatbots beyond language
Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, Limin Wang, Ping Luo, Jifeng Dai, and Yu Qiao. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. https://arxiv.org/abs/2305.05662, 2023
-
[27]
End-to-end learning of visual representations from uncurated instructional videos
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020
work page 2020
-
[28]
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021
- [29]
-
[30]
Chatgpt: Optimizing language models for dialogue
TB OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022
work page 2022
-
[31]
Im2text: Describing images using 1 million captioned photographs
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems , 24, 2011
work page 2011
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022
work page 2022
-
[33]
Robust Speech Recognition via Large-Scale Weak Supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 21(1):5485–5551, 2020
work page 2020
-
[35]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018
work page 2018
-
[36]
How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021
Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021
-
[37]
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023
work page internal anchor Pith review arXiv 2023
-
[38]
Chen Sun, Austin Myers, Carl V ondrick, Kevin P. Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. ICCV, 2019
work page 2019
-
[39]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [40]
-
[41]
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems , 2022
work page 2022
-
[42]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 13
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
All in one: Exploring unified video-language pre-training
Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022
-
[44]
Videomae v2: Scaling video masked autoencoders with dual masking
Limin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, and Yu Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
work page 2023
-
[45]
Internimage: Exploring large-scale vision foundation models with deformable convolutions
Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023
work page 2023
-
[46]
Internvideo: General video foundation models via generative and discriminative learning
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022
-
[47]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Grit: A generative region-to-text transformer for object understanding
Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, and Lijuan Wang. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022
-
[49]
Videoclip: Contrastive pre-training for zero-shot video-text understanding
Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021
-
[50]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. ArXiv, abs/2303.11381, 2023
work page internal anchor Pith review arXiv 2023
-
[51]
Filip: Fine-grained interactive language-image pre-training
Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021
-
[52]
mplug-owl: Modularization empowers large language models with multimodality, 2023
Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chaoya Jiang, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023
work page 2023
-
[53]
Florence: A new foundation model for computer vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021
-
[54]
Merlot reserve: Neural script knowledge through vision and language and sound
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022
work page 2022
-
[55]
Merlot: Multimodal neural script knowledge models
Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. NeurIPS, 2021
work page 2021
-
[56]
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022
work page internal anchor Pith review arXiv 2022
-
[57]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[59]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Describe the following image concisely
Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. CVPR, 2020. 14 A Appendix Instruction for a brief description. Following the brief image instruction in LLaV A, we generate the video instruction with the aid of ChatGPT as shown in Table 8. In Stage1, we randomly sample the instruction to generate brief descriptions of im...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.