Recognition: 3 theorem links
· Lean TheoremLLaVA-Video: Video Instruction Tuning With Synthetic Data
Pith reviewed 2026-05-10 23:16 UTC · model grok-4.3
The pith
Synthetic video instructions train a model that performs strongly on real benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A synthetic dataset called LLaVA-Video-178K is created to supply video-specific instruction examples in three formats: detailed captioning, open-ended question answering, and multiple-choice question answering. When a video model is trained on this dataset together with prior visual instruction data, it records strong results across multiple established video benchmarks.
What carries the argument
The synthetic dataset generation pipeline that produces the 178K video instructions for captioning and question-answering tasks.
If this is right
- Video models become easier to scale because training no longer depends on massive new crawls of real footage.
- The same synthetic approach can be used to add more examples for specific video tasks without additional manual labeling.
- Releasing both the dataset and the generation code lets other groups reproduce and extend the training process.
Where Pith is reading between the lines
- The method could be adapted to generate training examples for models that handle longer or more complex video sequences.
- Synthetic instructions might reduce reliance on web-sourced data that often carries cultural or platform-specific biases.
- Direct comparison of the synthetic model against one trained only on real data of equal size would quantify how much quality is lost or gained.
Load-bearing premise
The generated instructions contain enough realistic detail and variety that models trained on them will generalize to unseen real-world videos rather than only succeeding on the chosen test sets.
What would settle it
Running the trained model on a fresh collection of real videos drawn from sources outside the generation process and observing a large drop in accuracy on the same task types would show the synthetic data did not produce genuine generalization.
read the original abstract
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LLaVA-Video-178K, a synthetic dataset of 178K video instruction-following examples covering detailed captioning, open-ended QA, and multiple-choice QA tasks. By training on this dataset in combination with existing visual instruction tuning data, the authors create LLaVA-Video, a video large multimodal model, and report strong performance across various video benchmarks. The work plans to release the dataset, generation pipeline, and model checkpoints.
Significance. If the synthetic data is shown to drive measurable gains, the approach could meaningfully address data scarcity challenges in video LMM development by offering a scalable alternative to web curation. The explicit commitment to releasing the full dataset, pipeline, and checkpoints is a clear strength for reproducibility and community follow-up work.
major comments (2)
- [Experimental Results] The central claim that LLaVA-Video-178K is effective rests on benchmark gains, yet the experimental section contains no ablation that trains an identical model on existing visual instruction data alone and measures the delta on the same video benchmarks. Without this control, performance cannot be attributed to the synthetic data rather than the base model or prior data.
- [§3] §3 (Data Generation): The pipeline description provides no quantitative metrics or controls for data quality, diversity, or hallucination rates in the generated instructions. This is load-bearing because the assumption that the synthetic data supports genuine generalization (rather than benchmark overfitting) is not tested.
minor comments (1)
- [Abstract] The abstract asserts 'strong performance' without any numerical results, baselines, or table references; including one or two key metrics would make the claim immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We agree that additional controls would strengthen the attribution of gains to the synthetic dataset and provide better evidence of data quality. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experimental Results] The central claim that LLaVA-Video-178K is effective rests on benchmark gains, yet the experimental section contains no ablation that trains an identical model on existing visual instruction data alone and measures the delta on the same video benchmarks. Without this control, performance cannot be attributed to the synthetic data rather than the base model or prior data.
Authors: We agree that a direct ablation isolating the contribution of LLaVA-Video-178K is important for attributing performance gains. In the revised manuscript, we will add results from training the identical base model and training setup on the existing visual instruction tuning data alone (without LLaVA-Video-178K) and report the performance deltas on the same video benchmarks. This will allow clearer attribution of improvements to the synthetic data. revision: yes
-
Referee: [§3] §3 (Data Generation): The pipeline description provides no quantitative metrics or controls for data quality, diversity, or hallucination rates in the generated instructions. This is load-bearing because the assumption that the synthetic data supports genuine generalization (rather than benchmark overfitting) is not tested.
Authors: We acknowledge that quantitative metrics for data quality would strengthen the paper. In the revised §3, we will include quantitative controls such as diversity statistics (e.g., vocabulary size, n-gram uniqueness across samples, and semantic embedding variance) and hallucination estimates obtained via a combination of automated consistency checks and manual sampling. We will also discuss how these metrics support the claim of genuine generalization rather than overfitting. revision: yes
Circularity Check
No circularity: empirical training and external benchmark evaluation
full rationale
The paper describes an empirical pipeline: synthetic data generation for video instruction tuning, combined training with existing data, and evaluation on standard external video benchmarks. No derivation chain, equations, or first-principles predictions are present that could reduce to inputs by construction. Claims of effectiveness rest on measured benchmark performance rather than self-definitional fits, renamed results, or load-bearing self-citations. This is a standard non-circular empirical ML paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthetic video instruction data generated via the described pipeline is sufficiently high-quality and representative to improve model performance on real benchmarks.
Forward citations
Cited by 56 Pith papers
-
Flame3D: Zero-shot Compositional Reasoning of 3D Scenes with Agentic Language Models
Flame3D enables zero-shot compositional 3D scene reasoning by representing scenes as editable visual-textual memories exposed to agentic MLLMs through composable and synthesizable spatial tools.
-
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
-
RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval
Generalized Moment Retrieval (GMR) is introduced as a unified task with the Soccer-GMR benchmark and adapter models that retrieve multiple or zero matching moments from videos.
-
Membership Inference Attacks Against Video Large Language Models
A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
-
Don't Pause! Every prediction matters in a streaming video
SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a new movie-sourced dataset and benchmark that enables AI models to generate multi-shot videos with improved narrative coherence and subject identity preservation.
-
MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation
MuSS is a movie-derived dataset and benchmark that enables AI models to generate multi-shot videos with coherent narratives and preserved subject identity across shots.
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
-
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
MCSC-Bench is the first large-scale dataset for the Multimodal Context-to-Script Creation task, requiring models to select relevant shots from redundant materials, plan missing shots, and generate coherent scripts wit...
-
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
-
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
BoxComm: Benchmarking Category-Aware Commentary Generation and Narration Rhythm in Boxing
BoxComm is the first large-scale benchmark for category-aware commentary generation and rhythm assessment in boxing, showing state-of-the-art multimodal models struggle with tactical analysis and temporal pacing.
-
Unified Reward Model for Multimodal Understanding and Generation
UnifiedReward is the first unified reward model that jointly assesses multimodal understanding and generation to provide better preference signals for aligning vision models via DPO.
-
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models
OTT-Vid uses optimal transport with non-uniform token mass and locality-aware costs to dynamically allocate compression budgets across video frames, retaining 95.8% VQA and 73.9% VTG performance at 10% token retention.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
-
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Geo3DPruner uses geometry-aware global attention and two-stage voxel pruning to remove 90% of visual tokens from spatial videos while keeping over 90% of original performance on 3D scene benchmarks.
-
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
ABMAMBA: Multimodal Large Language Model with Aligned Hierarchical Bidirectional Scan for Efficient Video Captioning
ABMamba uses Mamba-based linear-complexity processing plus a novel Aligned Hierarchical Bidirectional Scan to deliver competitive video captioning on VATEX and MSR-VTT at roughly 3x higher throughput than typical Tran...
-
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
-
Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs
GUIDE unrolls multi-granularity geometric priors layer-wise into early MLLM layers with gating to improve spatial reasoning and perception.
-
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
-
Watch Before You Answer: Learning from Visually Grounded Post-Training
Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
-
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding
Video-MME-v2 is a new benchmark that applies progressive visual-to-reasoning levels and non-linear group scoring to expose gaps in video MLLM capabilities.
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
-
Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Unlabeled web videos processed by designed data engines generate effective training data that yields strong zero-shot and finetuned performance on 3D detection, segmentation, VQA, and navigation.
-
SmolVLM: Redefining small and efficient multimodal models
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
What Limits Vision-and-Language Navigation ?
StereoNav reaches new benchmark highs on R2R-CE and RxR-CE and improves real-robot reliability by supplying persistent target-location priors and stereo-derived geometry that stay stable under lighting changes and blur.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
-
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
Dual-Anchoring adds explicit progress tokens and retrospective landmark verification to VLN agents, cutting state drift and lifting success rate 15.2% overall with 24.7% gains on long trajectories.
-
Think before Go: Hierarchical Reasoning for Image-goal Navigation
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
-
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Show-o2: Improved Native Unified Multimodal Models
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...
work page internal anchor Pith review arXiv 2022
-
[2]
Localizing moments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.\ 5803--5812, 2017 a
work page 2017
-
[3]
Localizing moments in video with natural language
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pp.\ 5803--5812, 2017 b
work page 2017
-
[4]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021
work page 2021
-
[5]
Activitynet: A large-scale video benchmark for human activity understanding
Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.\ 961--970, 2015
work page 2015
-
[7]
Collecting highly parallel data for paraphrase evaluation
David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp.\ 190--200, 2011
work page 2011
-
[10]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms, 2024. URL https://arxiv.org/abs/2406.07476
work page internal anchor Pith review arXiv 2024
-
[11]
Slowfast networks for video recognition
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 6202--6211, 2019
work page 2019
-
[13]
Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.\ 5842--5850, 2017
work page 2017
-
[14]
Ego4d: Around the world in 3,000 hours of egocentric video
Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 18995--19012, 2022
work page 2022
-
[15]
Agqa: A benchmark for compositional spatio-temporal reasoning
Madeleine Grunde-McLaughlin, Ranjay Krishna, and Maneesh Agrawala. Agqa: A benchmark for compositional spatio-temporal reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 11287--11297, 2021
work page 2021
-
[21]
Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning
Sangho Lee, Jiwan Chung, Youngjae Yu, Gunhee Kim, Thomas Breuel, Gal Chechik, and Yale Song. Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 10274--10284, 2021
work page 2021
-
[23]
Less is more: Clipbert for video-and-language learning via sparse sampling
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 7331--7341, 2021
work page 2021
-
[25]
Llava-next: What else influences visual instruction tuning beyond data?, May 2024 a
Bo Li, Hao Zhang, Kaichen Zhang, Dong Guo, Yuanhan Zhang, Renrui Zhang, Feng Li, Ziwei Liu, and Chunyuan Li. Llava-next: What else influences visual instruction tuning beyond data?, May 2024 a . URL https://llava-vl.github.io/blog/2024-05-25-llava-next-ablations/
work page 2024
-
[26]
Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024 b
Bo Li, Kaichen Zhang, Hao Zhang, Dong Guo, Renrui Zhang, Feng Li, Yuanhan Zhang, Ziwei Liu, and Chunyuan Li. Llava-next: Stronger llms supercharge multimodal capabilities in the wild, May 2024 b . URL https://llava-vl.github.io/blog/2024-05-10-llava-next-stronger-llms/
work page 2024
-
[28]
Multimodal foundation models: From specialists to general-purpose assistants
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao, et al. Multimodal foundation models: From specialists to general-purpose assistants. Foundations and Trends in Computer Graphics and Vision , 2024 d
work page 2024
-
[29]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. URL https://arxiv.org/abs/2301.12597
work page internal anchor Pith review arXiv 2023
-
[30]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding, 2024 e . URL https://arxiv.org/abs/2305.06355
work page internal anchor Pith review arXiv 2024
-
[31]
Vila: On pre-training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. Vila: On pre-training for visual language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 26689--26699, 2024
work page 2024
-
[32]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024 a
work page 2024
-
[34]
LMMs-Lab. Video detail caption, 2024. URL https://huggingface.co/datasets/lmms-lab/VideoDetailCaption
work page 2024
-
[36]
Video-chatgpt: Towards detailed video understanding via large vision and language models
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), 2024
work page 2024
-
[37]
Egoschema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[38]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. How T o100 M : L earning a T ext- V ideo E mbedding by W atching H undred M illion N arrated V ideo C lips. In ICCV, 2019
work page 2019
-
[39]
OpenAI. Gpt-4v. https://openai.com/index/gpt-4v-system-card/, 2023
work page 2023
- [40]
-
[41]
Perception test: A diagnostic benchmark for multimodal video models
Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, And...
work page 2023
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pp.\ 8748--8763. PMLR, 2021
work page 2021
-
[43]
Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020. URL https://arxiv.org/abs/2004.09813
-
[44]
A dataset for movie description
Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 3202--3212, 2015
work page 2015
-
[45]
Annotating objects and relations in user-generated videos
Xindi Shang, Donglin Di, Junbin Xiao, Yu Cao, Xun Yang, and Tat-Seng Chua. Annotating objects and relations in user-generated videos. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp.\ 279--287. ACM, 2019
work page 2019
-
[46]
Hollywood in homes: Crowdsourcing data collection for activity understanding
Gunnar A Sigurdsson, G \"u l Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision--ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part I 14, pp.\ 510--526. Springer, 2016
work page 2016
-
[48]
Tarsier: Recipes for training and evaluating large video description models,
Jiawei Wang, Liping Yuan, and Yuchen Zhang. Tarsier: Recipes for training and evaluating large video description models, 2024. URL https://arxiv.org/abs/2407.00634
-
[49]
Internvid: A large-scale video-text dataset for multimodal understanding and generation
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. In The Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[51]
Longvideobench: A benchmark for long-context inter- leaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding, 2024 b . URL https://arxiv.org/abs/2407.15754
-
[52]
Next-qa: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 9777--9786, 2021
work page 2021
-
[53]
Video question answering via gradually refined attention over appearance and motion
Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017
work page 2017
-
[54]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 5288--5296, 2016
work page 2016
-
[58]
Advancing high-resolution video-language representation with large-scale video transcriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video-language representation with large-scale video transcriptions. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[60]
Activitynet-qa: A dataset for understanding complex web videos via question answering
Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pp.\ 9127--9134, 2019
work page 2019
-
[61]
Social-iq: A question answering benchmark for artificial social intelligence
Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8807--8817, 2019
work page 2019
-
[62]
Merlot: Multimodal neural script knowledge models
Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, and Yejin Choi. Merlot: Multimodal neural script knowledge models. Advances in neural information processing systems, 34: 0 23634--23651, 2021
work page 2021
-
[63]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 11975--11986, 2023
work page 2023
-
[68]
Direct preference optimization of video large multimodal models from language model reward, 2024 d
Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, and Yiming Yang. Direct preference optimization of video large multimodal models from language model reward, 2024 d
work page 2024
-
[69]
Llava-next: A strong zero-shot video understanding model, April 2024 e
Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024 e . URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/
work page 2024
-
[71]
Luowei Zhou and Jason J. Corso. Youcookii dataset. 2017. URL https://api.semanticscholar.org/CorpusID:19774151
work page 2017
-
[72]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Wang HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment, 2023 a
work page 2023
-
[74]
Visual Prompt Tuning , author=. arXiv preprint arXiv:2203.12119 , year=
-
[75]
International Conference on Machine Learning (ICML) , pages=
Parameter-efficient transfer learning for NLP , author=. International Conference on Machine Learning (ICML) , pages=. 2019 , organization=
work page 2019
-
[76]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[77]
Towards a unified view of parameter-efficient transfer learning,
Towards a unified view of parameter-efficient transfer learning , author=. arXiv preprint arXiv:2110.04366 , year=
-
[78]
Factual probing is [mask]: Learning vs. learning to recall , author=. arXiv preprint arXiv:2104.05240 , year=
-
[79]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models , author=. arXiv preprint arXiv:2106.10199 , year=
-
[80]
Advances in neural information processing systems (NeuIPS) , volume=
Attention is all you need , author=. Advances in neural information processing systems (NeuIPS) , volume=
-
[81]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[82]
Advances in neural information processing systems (NeuIPS) , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems (NeuIPS) , volume=
-
[83]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=
work page internal anchor Pith review arXiv
-
[84]
The Power of Scale for Parameter-Efficient Prompt Tuning
The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=
work page internal anchor Pith review arXiv
-
[85]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Conditional Prompt Learning for Vision-Language Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[86]
International Journal of Computer Vision (IJCV) , year=
Learning to Prompt for Vision-Language Models , author=. International Journal of Computer Vision (IJCV) , year=
-
[87]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
Prompt Distribution Learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year=
-
[88]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[89]
arXiv preprint arXiv:2110.07577 , year=
Unipelt: A unified framework for parameter-efficient language model tuning , author=. arXiv preprint arXiv:2110.07577 , year=
-
[90]
Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578, 2016
Neural architecture search with reinforcement learning , author=. arXiv preprint arXiv:1611.01578 , year=
-
[91]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Learning transferable architectures for scalable image recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[92]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , year=
Learning generalisable omni-scale representations for person re-identification , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , year=
-
[93]
International Conference on Machine Learning (ICML) , pages=
Large-scale evolution of image classifiers , author=. International Conference on Machine Learning (ICML) , pages=. 2017 , organization=
work page 2017
-
[94]
International conference on machine learning (ICML) , pages=
Efficient neural architecture search via parameters sharing , author=. International conference on machine learning (ICML) , pages=. 2018 , organization=
work page 2018
-
[95]
DARTS: Differentiable Architecture Search
Darts: Differentiable architecture search , author=. arXiv preprint arXiv:1806.09055 , year=
-
[96]
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=
Autoformer: Searching transformers for visual recognition , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=
-
[97]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[98]
A large-scale study of representation learning with the visual task adaptation benchmark , author=. arXiv preprint arXiv:1910.04867 , year=
- [99]
-
[100]
dsprites: Disentanglement testing sprites dataset , author=
-
[101]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[102]
A Generalist Agent , author=. arXiv preprint arXiv:2205.06175 , year=
work page internal anchor Pith review arXiv
-
[103]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , volume=
Learning methods for generic object recognition with invariance to pose and lighting , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , volume=. 2004 , organization=
work page 2004
-
[104]
European conference on computer vision (ECCV) , pages=
Visualizing and understanding convolutional networks , author=. European conference on computer vision (ECCV) , pages=. 2014 , organization=
work page 2014
-
[105]
arXiv preprint arXiv:2103.02503 , year=
Domain generalization: A survey , author=. arXiv preprint arXiv:2103.02503 , year=
-
[106]
Florence: A new foundation model for computer vision
Florence: A New Foundation Model for Computer Vision , author=. arXiv preprint arXiv:2111.11432 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.