LLaVA-Video: Video Instruction Tuning With Synthetic Data
Pith reviewed 2026-05-10 23:16 UTC · model grok-4.3
The pith
Synthetic video instructions train a model that performs strongly on real benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A synthetic dataset called LLaVA-Video-178K is created to supply video-specific instruction examples in three formats: detailed captioning, open-ended question answering, and multiple-choice question answering. When a video model is trained on this dataset together with prior visual instruction data, it records strong results across multiple established video benchmarks.
What carries the argument
The synthetic dataset generation pipeline that produces the 178K video instructions for captioning and question-answering tasks.
If this is right
- Video models become easier to scale because training no longer depends on massive new crawls of real footage.
- The same synthetic approach can be used to add more examples for specific video tasks without additional manual labeling.
- Releasing both the dataset and the generation code lets other groups reproduce and extend the training process.
Where Pith is reading between the lines
- The method could be adapted to generate training examples for models that handle longer or more complex video sequences.
- Synthetic instructions might reduce reliance on web-sourced data that often carries cultural or platform-specific biases.
- Direct comparison of the synthetic model against one trained only on real data of equal size would quantify how much quality is lost or gained.
Load-bearing premise
The generated instructions contain enough realistic detail and variety that models trained on them will generalize to unseen real-world videos rather than only succeeding on the chosen test sets.
What would settle it
Running the trained model on a fresh collection of real videos drawn from sources outside the generation process and observing a large drop in accuracy on the same task types would show the synthetic data did not produce genuine generalization.
read the original abstract
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaVA-Video-178K, a synthetic dataset of 178K video instruction-following examples covering detailed captioning, open-ended QA, and multiple-choice QA tasks. By training on this dataset in combination with existing visual instruction tuning data, the authors create LLaVA-Video, a video large multimodal model, and report strong performance across various video benchmarks. The work plans to release the dataset, generation pipeline, and model checkpoints.
Significance. If the synthetic data is shown to drive measurable gains, the approach could meaningfully address data scarcity challenges in video LMM development by offering a scalable alternative to web curation. The explicit commitment to releasing the full dataset, pipeline, and checkpoints is a clear strength for reproducibility and community follow-up work.
major comments (2)
- [Experimental Results] The central claim that LLaVA-Video-178K is effective rests on benchmark gains, yet the experimental section contains no ablation that trains an identical model on existing visual instruction data alone and measures the delta on the same video benchmarks. Without this control, performance cannot be attributed to the synthetic data rather than the base model or prior data.
- [§3] §3 (Data Generation): The pipeline description provides no quantitative metrics or controls for data quality, diversity, or hallucination rates in the generated instructions. This is load-bearing because the assumption that the synthetic data supports genuine generalization (rather than benchmark overfitting) is not tested.
minor comments (1)
- [Abstract] The abstract asserts 'strong performance' without any numerical results, baselines, or table references; including one or two key metrics would make the claim immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We agree that additional controls would strengthen the attribution of gains to the synthetic dataset and provide better evidence of data quality. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experimental Results] The central claim that LLaVA-Video-178K is effective rests on benchmark gains, yet the experimental section contains no ablation that trains an identical model on existing visual instruction data alone and measures the delta on the same video benchmarks. Without this control, performance cannot be attributed to the synthetic data rather than the base model or prior data.
Authors: We agree that a direct ablation isolating the contribution of LLaVA-Video-178K is important for attributing performance gains. In the revised manuscript, we will add results from training the identical base model and training setup on the existing visual instruction tuning data alone (without LLaVA-Video-178K) and report the performance deltas on the same video benchmarks. This will allow clearer attribution of improvements to the synthetic data. revision: yes
-
Referee: [§3] §3 (Data Generation): The pipeline description provides no quantitative metrics or controls for data quality, diversity, or hallucination rates in the generated instructions. This is load-bearing because the assumption that the synthetic data supports genuine generalization (rather than benchmark overfitting) is not tested.
Authors: We acknowledge that quantitative metrics for data quality would strengthen the paper. In the revised §3, we will include quantitative controls such as diversity statistics (e.g., vocabulary size, n-gram uniqueness across samples, and semantic embedding variance) and hallucination estimates obtained via a combination of automated consistency checks and manual sampling. We will also discuss how these metrics support the claim of genuine generalization rather than overfitting. revision: yes
Circularity Check
No circularity: empirical training and external benchmark evaluation
full rationale
The paper describes an empirical pipeline: synthetic data generation for video instruction tuning, combined training with existing data, and evaluation on standard external video benchmarks. No derivation chain, equations, or first-principles predictions are present that could reduce to inputs by construction. Claims of effectiveness rest on measured benchmark performance rather than self-definitional fits, renamed results, or load-bearing self-citations. This is a standard non-circular empirical ML paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic video instruction data generated via the described pipeline is sufficiently high-quality and representative to improve model performance on real benchmarks.
Forward citations
Cited by 60 Pith papers
-
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
-
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
-
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
-
DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving
DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
-
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
-
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval
Generalized Moment Retrieval (GMR) is introduced as a unified task with the Soccer-GMR benchmark and adapter models that retrieve multiple or zero matching moments from videos.
-
Membership Inference Attacks Against Video Large Language Models
A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
-
Don't Pause! Every prediction matters in a streaming video
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
-
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
-
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
-
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
-
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
-
SCP: Spatial Causal Prediction in Video
SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
-
From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
MM-Mem distills video input through a hierarchical memory of sensory buffer, episodic stream, and symbolic schema, optimized by a semantic information bottleneck and SIB-GRPO, to achieve SOTA on long-horizon video benchmarks.
-
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
-
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning
GeoThinker enables active, task-conditioned geometry integration in MLLMs via spatial-grounded fusion and importance gating, reaching 72.6 on VSI-Bench.
-
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
-
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
-
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
-
ProMQA-Assembly: Multimodal Procedural QA Dataset on Assembly
ProMQA-Assembly is a new multimodal procedural QA dataset with 646 pairs on assembly activities, built via LLM-generated candidates verified by humans plus 81 task graphs, and used to benchmark multimodal models.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
-
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
SIV-Bench is a new video benchmark with 2,792 clips and 5,455 QA pairs that evaluates MLLMs on social scene understanding, state reasoning, and dynamics prediction using social relation theory.
-
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
GA-VLN builds a geometry-aware BEV representation from RGB-D inputs plus 3D foundation model features to deliver state-of-the-art vision-language navigation using only navigation data.
-
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
-
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
-
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
-
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...
-
DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs
DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
An Efficient Streaming Video Understanding Framework with Agentic Control
R3-Streaming uses cascaded control, age-aware memory forgetting, and TB-GRPO reinforcement learning to reach SOTA scores on streaming video benchmarks while cutting visual token usage by 95-96%.
-
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
LiteFrame is a lightweight video vision encoder trained with Compressed Token Distillation and Language Model Adaptation that achieves 35% lower end-to-end latency while handling 8x more frames and higher accuracy tha...
-
SEDualVLN: A Spatially-Enhanced Dual-System for Vision-Language Navigation
SEDualVLN proposes a spatially-enhanced dual-system VLN framework that pairs a fast VLM action generator with a slow MLLM waypoint planner and reports state-of-the-art results on VLN-CE benchmarks.
-
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
-
Unlocking Dense Metric Depth Estimation in VLMs
DepthVLM attaches a depth head to VLMs for native dense metric depth prediction alongside language outputs using a two-stage unified training schedule and a new indoor-outdoor benchmark.
-
Unlocking Dense Metric Depth Estimation in VLMs
DepthVLM converts a standard VLM into a dense metric depth predictor by attaching a lightweight head and training under unified vision-text supervision, outperforming prior VLMs and some pure vision models on a new in...
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...
work page internal anchor Pith review arXiv 2022
-
[2]
Localizing moments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.\ 5803--5812, 2017 a
work page 2017
-
[3]
Localizing moments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.\ 5803--5812, 2017 b
work page 2017
-
[4]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021
work page 2021
-
[5]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.\ 961--970, 2015
work page 2015
-
[7]
Collecting highly parallel data for paraphrase evaluation
David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp.\ 190--200, 2011
work page 2011
-
[10]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024. URL https://arxiv.org/abs/2406.07476
work page internal anchor Pith review arXiv 2024
-
[11]
Slowfast networks for video recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 6202--6211, 2019
work page 2019
-
[13]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017
work page 2017
-
[14]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022
work page 2022
-
[15]
Agqa: A benchmark for compositional spatio-temporal reasoning
Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11287--11297, 2021
work page 2021
-
[21]
Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning
Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, and Yale Song. Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 10274--10284, 2021
work page 2021
-
[23]
Less is more: Clipbert for video-and-language learning via sparse sampling
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 7331--7341, 2021
work page 2021
-
[25]
Llava-next: What else influences visual instruction tuning beyond data?, May 2024 a
Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024 a . URL https://llava-vl.github.io/blog/2024-05-25-llava-next-ablations/
work page 2024
-
[26]
Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024 b
Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024 b . URL https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/
work page 2024
-
[28]
Multimodal foundation models: From specialists to general-purpose assistants
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends in Computer Graphics and Vision , 2024 d
work page 2024
-
[29]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. URL https://arxiv.org/abs/2301.12597
work page internal anchor Pith review arXiv 2023
-
[30]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024 e . URL https://arxiv.org/abs/2305.06355
work page internal anchor Pith review arXiv 2024
-
[31]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26689--26699, 2024
work page 2024
-
[32]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 a
work page 2024
-
[34]
LMMs-Lab. Video detail caption, 2024. URL https://huggingface.co/datasets/lmms-lab/VideoDetailCaption
work page 2024
-
[36]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024
work page 2024
-
[37]
Egoschema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[38]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. How T o100 M : L earning a T ext- V ideo E mbedding by W atching H undred M illion N arrated V ideo C lips. In ICCV, 2019
work page 2019
-
[39]
OpenAI. Gpt-4v. https://openai.com/index/gpt-4v-system-card/, 2023
work page 2023
- [40]
-
[41]
Perception test: A diagnostic benchmark for multimodal video models
Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, And...
work page 2023
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[43]
Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020. URL https://arxiv.org/abs/2004.09813
-
[44]
A dataset for movie description
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3202--3212, 2015
work page 2015
-
[45]
Annotating objects and relations in user-generated videos
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp.\ 279--287. ACM, 2019
work page 2019
-
[46]
Hollywood in homes: Crowdsourcing data collection for activity understanding
Gunnar A Sigurdsson, G \"u l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, pp.\ 510--526. Springer, 2016
work page 2016
-
[48]
Tarsier: Recipes for training and evaluating large video description models
Jiawei Wang, Liping Yuan, and Yuchen Zhang. Tarsier: Recipes for training and evaluating large video description models, 2024. URL https://arxiv.org/abs/2407.00634
-
[49]
Internvid: A large-scale video-text dataset for multimodal understanding and generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[51]
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024 b . URL https://arxiv.org/abs/2407.15754
work page internal anchor Pith review arXiv 2024
-
[52]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 9777--9786, 2021
work page 2021
-
[53]
Video question answering via gradually refined attention over appearance and motion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017
work page 2017
-
[54]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016
work page 2016
-
[58]
Advancing high-resolution video-language representation with large-scale video transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[60]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pp.\ 9127--9134, 2019
work page 2019
-
[61]
Social-iq: A question answering benchmark for artificial social intelligence
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8807--8817, 2019
work page 2019
-
[62]
Merlot: Multimodal neural script knowledge models
Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. Advances in neural information processing systems, 34: 0 23634--23651, 2021
work page 2021
-
[63]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 11975--11986, 2023
work page 2023
-
[68]
Direct preference optimization of video large multimodal models from language model reward, 2024 d
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, and Yiming Yang. Direct preference optimization of video large multimodal models from language model reward, 2024 d
work page 2024
-
[69]
Llava-next: A strong zero-shot video understanding model, April 2024 e
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024 e . URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/
work page 2024
-
[71]
Luowei Zhou and Jason J. Corso. Youcookii dataset. 2017. URL https://api.semanticscholar.org/CorpusID:19774151
work page 2017
-
[72]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment, 2023 a
work page 2023
-
[74]
Visual Prompt Tuning , author=. arXiv preprint arXiv:2203.12119 , year=
-
[75]
International Conference on Machine Learning (ICML) , pages=
Parameter-efficient transfer learning for NLP , author=. International Conference on Machine Learning (ICML) , pages=. 2019 , organization=
work page 2019
-
[76]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
Towards a unified view of parameter-efficient transfer learning
Towards a unified view of parameter-efficient transfer learning , author=. arXiv preprint arXiv:2110.04366 , year=
-
[78]
Factual probing is [mask]: Learning vs
Factual probing is [mask]: Learning vs. learning to recall , author=. arXiv preprint arXiv:2104.05240 , year=
-
[79]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models , author=. arXiv preprint arXiv:2106.10199 , year=
-
[80]
Advances in neural information processing systems (NeuIPS) , volume=
Attention is all you need , author=. Advances in neural information processing systems (NeuIPS) , volume=
-
[81]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[82]
Advances in neural information processing systems (NeuIPS) , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems (NeuIPS) , volume=
-
[83]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=
work page internal anchor Pith review arXiv
-
[84]
The Power of Scale for Parameter-Efficient Prompt Tuning
The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=
work page internal anchor Pith review arXiv
-
[85]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Conditional Prompt Learning for Vision-Language Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[86]
International Journal of Computer Vision (IJCV) , year=
Learning to Prompt for Vision-Language Models , author=. International Journal of Computer Vision (IJCV) , year=
-
[87]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Prompt Distribution Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[88]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[89]
Unipelt: A unified framework for parameter-efficient language model tuning,
Unipelt: A unified framework for parameter-efficient language model tuning , author=. arXiv preprint arXiv:2110.07577 , year=
-
[90]
Neural Architecture Search with Reinforcement Learning
Neural architecture search with reinforcement learning , author=. arXiv preprint arXiv:1611.01578 , year=
-
[91]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Learning transferable architectures for scalable image recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[92]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , year=
Learning generalisable omni-scale representations for person re-identification , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , year=
-
[93]
International Conference on Machine Learning (ICML) , pages=
Large-scale evolution of image classifiers , author=. International Conference on Machine Learning (ICML) , pages=. 2017 , organization=
work page 2017
-
[94]
International conference on machine learning (ICML) , pages=
Efficient neural architecture search via parameters sharing , author=. International conference on machine learning (ICML) , pages=. 2018 , organization=
work page 2018
-
[95]
DARTS: Differentiable Architecture Search
Darts: Differentiable architecture search , author=. arXiv preprint arXiv:1806.09055 , year=
-
[96]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=
Autoformer: Searching transformers for visual recognition , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=
-
[97]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[98]
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
A large-scale study of representation learning with the visual task adaptation benchmark , author=. arXiv preprint arXiv:1910.04867 , year=
work page internal anchor Pith review arXiv 1910
-
[99]
Deepmind lab , author=. arXiv preprint arXiv:1612.03801 , year=
-
[100]
dsprites: Disentanglement testing sprites dataset , author=
-
[101]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[102]
A Generalist Agent , author=. arXiv preprint arXiv:2205.06175 , year=
work page internal anchor Pith review arXiv
-
[103]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , volume=
Learning methods for generic object recognition with invariance to pose and lighting , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , volume=. 2004 , organization=
work page 2004
-
[104]
European conference on computer vision (ECCV) , pages=
Visualizing and understanding convolutional networks , author=. European conference on computer vision (ECCV) , pages=. 2014 , organization=
work page 2014
-
[105]
Domain gen- eralization in vision: A survey,
Domain generalization: A survey , author=. arXiv preprint arXiv:2103.02503 , year=
-
[106]
Florence: A New Foundation Model for Computer Vision
Florence: A New Foundation Model for Computer Vision , author=. arXiv preprint arXiv:2111.11432 , year=
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.