Recognition: 2 theorem links
· Lean TheoremVideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Pith reviewed 2026-05-11 02:40 UTC · model grok-4.3
The pith
VideoLLaMA 2 adds a spatial-temporal convolution connector and audio branch to advance video and audio understanding in large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector that effectively captures the intricate spatial and temporal dynamics of video data, and integrates an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities by seamlessly incorporating audio cues, as shown by competitive results on MC-VQA, OE-VQA, VC, AQA and OE-AVQA benchmarks.
What carries the argument
The Spatial-Temporal Convolution (STC) connector, which processes video features to model spatial and temporal relations, together with an Audio Branch added via joint training.
If this is right
- The STC connector enables more accurate capture of video dynamics for tasks such as question answering and captioning.
- Joint audio-visual training produces measurable gains on both audio-only and combined audio-video question-answering benchmarks.
- Open-source Video-LLMs can reach performance levels close to some proprietary models on several standard video tasks through these architectural choices.
- Multimodal comprehension improves when audio cues are integrated directly rather than handled separately after visual processing.
Where Pith is reading between the lines
- Similar convolution-style connectors might be adapted to improve temporal modeling in other video analysis settings such as action recognition or event detection.
- Early fusion of audio during training could reduce reliance on ever-larger visual-only backbones for equivalent multimodal performance.
- The approach suggests a path for extending video LLMs to longer or more complex video sequences by refining the connector rather than increasing overall parameter count.
Load-bearing premise
The reported performance gains come from the STC connector and audio branch rather than from differences in training data, model scale, or other implementation details not described.
What would settle it
Train an otherwise identical model using the same data and base components but without the STC connector or audio branch, then measure whether the benchmark scores on video and audio question-answering tasks drop by a noticeable margin.
read the original abstract
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VideoLLaMA 2, an extension of the authors' prior VideoLLaMA model, that adds a Spatial-Temporal Convolution (STC) connector to better capture spatial and temporal video dynamics and an Audio Branch trained jointly to incorporate audio cues. It reports competitive results among open-source models (and proximity to some proprietary ones) on MC-VQA, OE-VQA, VC, AQA, and OE-AVQA benchmarks, attributing these outcomes to the new components.
Significance. If the performance gains can be shown to stem specifically from the STC connector and audio branch rather than differences in training data, model scale, or other implementation details, the work would provide a useful incremental advance in multimodal video-language modeling by addressing spatio-temporal modeling and audio integration. The public release of models supports reproducibility and further research.
major comments (3)
- [§4 (Experiments)] §4 (Experiments): The reported benchmark results compare VideoLLaMA 2 against other open-source and proprietary models without controlled ablations that hold training data volume/quality, base LLM scale, and optimization fixed while isolating the STC connector and Audio Branch. This makes it impossible to attribute the claimed 'reasonable improvements' specifically to the proposed additions rather than confounding factors.
- [§3 (Method)] §3 (Method): The description of the STC connector (e.g., kernel sizes, stride, how it interfaces with the vision encoder and LLM) and the Audio Branch (e.g., fusion mechanism, joint training objective) is high-level; without equations or architectural diagrams that allow precise reproduction, the novelty and load-bearing role of these components cannot be assessed.
- [Tables 1–3] Table 1–3 (benchmark results): No statistical significance tests, error bars, or multiple-run averages are reported, and baseline details (data splits, exact training recipes) are omitted. This weakens the central claim that VideoLLaMA 2 'consistently achieves competitive results' and 'sets a new standard'.
minor comments (2)
- [Abstract] Abstract: The phrasing 'setting a new standard for intelligent video analysis systems' overstates the results, which are described only as 'competitive' and 'close to some proprietary models'.
- [Throughout] Throughout: Define all acronyms (MC-VQA, OE-VQA, VC, AQA, OE-AVQA) on first use and ensure consistent capitalization of 'VideoLLaMA 2'.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve reproducibility, attribution of results, and statistical rigor where feasible.
read point-by-point responses
-
Referee: [§4 (Experiments)] The reported benchmark results compare VideoLLaMA 2 against other open-source and proprietary models without controlled ablations that hold training data volume/quality, base LLM scale, and optimization fixed while isolating the STC connector and Audio Branch. This makes it impossible to attribute the claimed 'reasonable improvements' specifically to the proposed additions rather than confounding factors.
Authors: We agree that isolating the contributions of the STC connector and Audio Branch via fully controlled ablations (fixed data, base LLM, and optimization) would strengthen causal attribution. In the original manuscript we compared against models of comparable scale and training regimes, but we have now added a dedicated ablation study section (new Table 4 and accompanying text) that trains variants with and without STC and with/without the Audio Branch on the same data and base model. These results show consistent gains attributable to each component. revision: yes
-
Referee: [§3 (Method)] The description of the STC connector (e.g., kernel sizes, stride, how it interfaces with the vision encoder and LLM) and the Audio Branch (e.g., fusion mechanism, joint training objective) is high-level; without equations or architectural diagrams that allow precise reproduction, the novelty and load-bearing role of these components cannot be assessed.
Authors: We acknowledge the description was high-level. In the revised manuscript we have expanded §3 with explicit equations for the STC connector (including 3D convolution kernel sizes of 3×3×3, strides, padding, and the exact reshaping that maps vision-encoder features to LLM token space) and for the Audio Branch (cross-attention fusion and the joint training loss combining video and audio objectives). We have also added a detailed architectural diagram (new Figure 2) showing all interfaces. revision: yes
-
Referee: [Tables 1–3] Table 1–3 (benchmark results): No statistical significance tests, error bars, or multiple-run averages are reported, and baseline details (data splits, exact training recipes) are omitted. This weakens the central claim that VideoLLaMA 2 'consistently achieves competitive results' and 'sets a new standard'.
Authors: We recognize that reporting variance and significance would increase confidence. Due to the high computational cost of re-running all baselines multiple times, we have added (i) precise data-split and training-recipe details to the appendix, (ii) error bars from three random seeds for our own model on the main tables, and (iii) a note on the single-run nature of most competing open-source results. Full multi-seed re-evaluation of every baseline remains resource-prohibitive but we believe the added details address the core concern. revision: partial
Circularity Check
Minor self-citation to predecessor without load-bearing circularity in empirical claims
full rationale
The paper builds on the authors' prior VideoLLaMA work by adding an STC connector and audio branch, then reports competitive results on MC-VQA, OE-VQA, VC, AQA and OE-AVQA benchmarks. These performance claims rest on new evaluations rather than any derivation that reduces by construction to previously fitted quantities or self-cited premises. The self-citation is acknowledged but does not justify the central results; the new components are presented as architectural extensions whose value is measured externally. No equations, uniqueness theorems, or predictions collapse to inputs, satisfying the criteria for a low (non-circular) score.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Spatial-Temporal Convolution (STC) connector
no independent evidence
-
Audio Branch
no independent evidence
Forward citations
Cited by 58 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
-
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
-
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
TB-AVA uses text as a semantic anchor with a new Text-Bridged Audio-Visual Adapter and Gated Semantic Modulation to achieve state-of-the-art results on audio-visual benchmarks through parameter-efficient fine-tuning.
-
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassi...
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
-
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
-
Membership Inference Attacks Against Video Large Language Models
A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
-
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
-
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
-
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
-
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
-
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
-
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
-
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
-
Video-R1: Reinforcing Video Reasoning in MLLMs
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
-
MLVU: Benchmarking Multi-task Long Video Understanding
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
-
VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.
-
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
-
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
-
Probing Cross-modal Information Hubs in Audio-Visual LLMs
AVLLMs encode integrated audio-visual information primarily in specialized cross-modal sink tokens, which enables a training-free hallucination mitigation approach.
-
Probing Cross-modal Information Hubs in Audio-Visual LLMs
AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.
-
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
Separate modality-specific reasoning before fusion reduces hallucinations and improves accuracy in audio-visual LLMs by enforcing isolated traces then integrating evidence.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
-
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
-
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
-
From Priors to Perception: Grounding Video-LLMs in Physical Reality
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
-
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
-
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
-
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
-
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
-
Exploring Audio Hallucination in Egocentric Video Understanding
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
-
Video-ToC: Video Tree-of-Cue Reasoning
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
-
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
-
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
-
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
-
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
-
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
-
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
-
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
-
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
-
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...
-
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Cail- lon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325,
work page internal anchor Pith review arXiv
-
[2]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford,SerkanCabi,TengdaHan,ZhitaoGong,SinaSamangooei,MarianneMonteiro,Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- laj Binkowski, Ricardo ...
work page internal anchor Pith review arXiv
-
[3]
Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413,
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
JinzeBai,ShuaiBai,ShushengYang,ShijieWang,SinanTan,PengWang,JunyangLin,Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
19 Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023a. Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprin...
work page internal anchor Pith review arXiv
-
[6]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
arXiv preprint arXiv:2401.16420 , year=
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420,
-
[8]
Clotho: An audio captioning dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740. IEEE,
work page 2020
-
[9]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206,
work page internal anchor Pith review arXiv
-
[11]
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoa- gent: A memory-augmented multimodal agent for video understanding.arXiv preprint arXiv:2403.11481,
-
[12]
Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing
20 Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understand- ing. arXiv preprint arXiv:2406.14515,
-
[13]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Gemma: Open Models Based on Gemini Research and Technology
Thomas Gemma, Teamand Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Vocalsound: A dataset for improving human vocal sounds recognition
Yuan Gong, Jin Yu, and James Glass. Vocalsound: A dataset for improving human vocal sounds recognition. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 151–155. IEEE,
work page 2022
-
[16]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team Google. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding.arXiv preprint arXiv:2404.05726,
-
[19]
Vtimellm: Empower LLM to grasp video moments
Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments.arXiv preprint arXiv:2311.18445, 2(3):9,
-
[20]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, DiegodelasCasas, FlorianBressand, GiannaLengyel, GuillaumeLample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
arXiv preprint arXiv:2311.08046 , year=
BuJin,XinyuLiu,YupengZheng,PengfeiLi,HaoZhao,TongZhang,YuhangZheng,Guyue Zhou, and Jingjing Liu. Adapt: Action-aware driving caption transformer. In2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 7554–7561. IEEE, 2023a. Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation ...
-
[23]
Pegasus-v1 technical report.arXiv preprint arXiv:2404.14687,
Raehyuk Jung, Hyojun Go, Jaehyuk Yi, Jiho Jang, Daniel Kim, Jay Suh, Aiden Lee, Cooper Han, Jae Lee, Jeff Kim, et al. Pegasus-v1 technical report.arXiv preprint arXiv:2404.14687,
-
[24]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950,
work page internal anchor Pith review arXiv
-
[25]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 119–132,
work page 2019
-
[26]
Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,
-
[27]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a lar...
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Video- vista: A versatile benchmark for video understanding and reasoning, 2024c. 22 BinLin,BinZhu,YangYe,MunanNing,PengJin,andLiYuan. Video-llava: Learningunited visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023a. Ji Lin, Hongxu Yin, Wei ...
work page internal anchor Pith review arXiv
-
[29]
Mm-vid: Advancing video understanding with gpt-4v (ision).arXiv preprint arXiv:2310.19773, 2023b
Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, et al. Mm-vid: Advancing video understanding with gpt-4v (ision).arXiv preprint arXiv:2310.19773, 2023b. Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al...
-
[30]
arXiv preprint arXiv:2306.09093 , year=
Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, ShumingShi,andZhaopengTu. Macaw-llm: Multi-modallanguagemodelingwithimage, audio, video, and text integration.arXiv preprint arXiv:2306.09093,
-
[31]
Vista- llama: Reliable video narrator via equal distance to visual tokens,
FanMa,XiaojieJin,HengWang,YuchenXian,JiashiFeng,andYiYang. Vista-llama: Reliable video narrator via equal distance to visual tokens.arXiv preprint arXiv:2312.08870,
-
[32]
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023a. 23 Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- chatgpt: Towards detailed video understanding via large vision and language mo...
-
[33]
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research.arXiv preprint arXiv:2303.17395,
-
[34]
Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,
Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen,AnilKag,YuweiFang,AlekseiStoliar,ElisaRicci,JianRen,etal. Snapvideo: Scaled spatiotemporal transformers for text-to-video synthesis.arXiv preprint arXiv:2402.14797,
-
[35]
Tut database for acoustic scene classification and sound event detection
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Tut database for acoustic scene classification and sound event detection. In2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132, 2016a. doi: 10.1109/EUSIPCO.2016.7760424. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and soun...
-
[36]
OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023a. OpenAI. Gpt-4v(ision) system card, 2023b. URLhttps://openai.com/research/ gpt-4v-system-card. OpenAI. Gpt-4o system card,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
URL https://openai.com/index/ hello-gpt-4o/. Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799,
-
[38]
Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI
24 Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238,
work page internal anchor Pith review arXiv
-
[39]
Reka core, flash, and edge: A series of powerful multi- modal language models
Reka team Reka. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387,
-
[40]
Timechat: A time-sensitive multimodal large lan- guage model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multi- modallargelanguagemodelforlongvideounderstanding. arXivpreprintarXiv:2312.02051 ,
-
[41]
Yihua Shao, Hongyi Cai, Wenxin Long, Weiyi Lang, Zhe Wang, Haoran Wu, Yan Wang, YangYang,andZhenLei
ISBN 9781450330633. Yihua Shao, Hongyi Cai, Wenxin Long, Weiyi Lang, Zhe Wang, Haoran Wu, Yan Wang, YangYang,andZhenLei. Accidentblip2: Accidentdetectionwithmulti-viewmotionblip2. arXiv preprint arXiv:2404.12149,
-
[42]
Audio-visual llm for video understanding
Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio-visual llm for video under- standing. arXiv preprint arXiv:2312.06720,
-
[43]
Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449,
-
[44]
Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,
-
[45]
Yunlong Tang, Daiki Shimada, Jing Bi, and Chenliang Xu. Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue.arXiv preprint arXiv:2403.16276,
-
[46]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. Apictureisworthmorethan77texttokens: Evaluatingclip-stylemodels on dense captions.arXiv preprint arXiv:2312.08578,
-
[48]
Tarsier: Recipes for training and evaluating large video description models,
Jiawei Wang, Liping Yuan, and Yuchen Zhang. Tarsier: Recipes for training and evaluating large video description models, 2024a. URLhttps://arxiv.org/abs/2407.00634. Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang,LeiZhao,XixuanSong,etal. Cogvlm: Visualexpertforpretrainedlanguagemodels. arXiv preprint arXiv:2311.03...
-
[49]
arXiv preprint arXiv:2307.06942 (2023)
Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multi- modal understanding and generation.arXiv preprint arXiv:2307.06942, 2023b. Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang...
-
[50]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE...
work page internal anchor Pith review arXiv
-
[51]
arXiv preprint arXiv:2404.16994 , year=
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024a. Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch b...
-
[52]
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios, 2024a. 26 Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yi Zhou, Junyan Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qiang...
-
[53]
Clevrer: Collision events for video representation and reasoning
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13040–13051, 2024b. Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu...
-
[54]
Yi: Open Foundation Models by 01.AI
ai. arXiv preprint arXiv:2403.04652,
work page internal anchor Pith review arXiv
-
[55]
Shoubin Yu, Jaehong Yoon, and Mohit Bansal. Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion.arXiv preprint arXiv:2402.05889,
-
[56]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858,
work page internal anchor Pith review arXiv
-
[57]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264,
work page internal anchor Pith review arXiv
-
[58]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang,JunwuZhang,ZongweiLi,etal. Languagebind: Extendingvideo-languagepretrain- ing to n-modality by language-based semantic alignment.arXiv preprint arXiv:2310.01852, 2023a. Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-l...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.