arxiv: 2406.07476 · v3 · submitted 2024-06-11 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Deli Zhao, Guanzheng Chen, Hang Zhang, Lidong Bing, Sicong Leng, Wenqi Zhang, Xin Li, Yifei Xin, Yongxin Zhu, Zesen Cheng, Ziyang Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:40 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords Video-LLMsspatial-temporal modelingaudio understandingvideo question answeringvideo captioningmultimodal modelsSTC connector

0 comments

The pith

VideoLLaMA 2 adds a spatial-temporal convolution connector and audio branch to advance video and audio understanding in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VideoLLaMA 2 as an improved Video-LLM that uses a dedicated Spatial-Temporal Convolution connector to handle video spatial and temporal patterns more effectively than prior designs. It also adds an audio branch trained jointly with the visual components to include sound information in the model's reasoning. On benchmarks for video question answering and captioning, the model reaches competitive scores among open-source systems and nears some proprietary ones, while showing gains on audio and audio-video tasks. These results indicate that targeted additions to the connector and audio pathway can strengthen multimodal video comprehension.

Core claim

VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector that effectively captures the intricate spatial and temporal dynamics of video data, and integrates an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities by seamlessly incorporating audio cues, as shown by competitive results on MC-VQA, OE-VQA, VC, AQA and OE-AVQA benchmarks.

What carries the argument

The Spatial-Temporal Convolution (STC) connector, which processes video features to model spatial and temporal relations, together with an Audio Branch added via joint training.

If this is right

The STC connector enables more accurate capture of video dynamics for tasks such as question answering and captioning.
Joint audio-visual training produces measurable gains on both audio-only and combined audio-video question-answering benchmarks.
Open-source Video-LLMs can reach performance levels close to some proprietary models on several standard video tasks through these architectural choices.
Multimodal comprehension improves when audio cues are integrated directly rather than handled separately after visual processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar convolution-style connectors might be adapted to improve temporal modeling in other video analysis settings such as action recognition or event detection.
Early fusion of audio during training could reduce reliance on ever-larger visual-only backbones for equivalent multimodal performance.
The approach suggests a path for extending video LLMs to longer or more complex video sequences by refining the connector rather than increasing overall parameter count.

Load-bearing premise

The reported performance gains come from the STC connector and audio branch rather than from differences in training data, model scale, or other implementation details not described.

What would settle it

Train an otherwise identical model using the same data and base components but without the STC connector or audio branch, then measure whether the benchmark scores on video and audio question-answering tasks drop by a noticeable margin.

read the original abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces VideoLLaMA 2, an extension of the authors' prior VideoLLaMA model, that adds a Spatial-Temporal Convolution (STC) connector to better capture spatial and temporal video dynamics and an Audio Branch trained jointly to incorporate audio cues. It reports competitive results among open-source models (and proximity to some proprietary ones) on MC-VQA, OE-VQA, VC, AQA, and OE-AVQA benchmarks, attributing these outcomes to the new components.

Significance. If the performance gains can be shown to stem specifically from the STC connector and audio branch rather than differences in training data, model scale, or other implementation details, the work would provide a useful incremental advance in multimodal video-language modeling by addressing spatio-temporal modeling and audio integration. The public release of models supports reproducibility and further research.

major comments (3)

[§4 (Experiments)] §4 (Experiments): The reported benchmark results compare VideoLLaMA 2 against other open-source and proprietary models without controlled ablations that hold training data volume/quality, base LLM scale, and optimization fixed while isolating the STC connector and Audio Branch. This makes it impossible to attribute the claimed 'reasonable improvements' specifically to the proposed additions rather than confounding factors.
[§3 (Method)] §3 (Method): The description of the STC connector (e.g., kernel sizes, stride, how it interfaces with the vision encoder and LLM) and the Audio Branch (e.g., fusion mechanism, joint training objective) is high-level; without equations or architectural diagrams that allow precise reproduction, the novelty and load-bearing role of these components cannot be assessed.
[Tables 1–3] Table 1–3 (benchmark results): No statistical significance tests, error bars, or multiple-run averages are reported, and baseline details (data splits, exact training recipes) are omitted. This weakens the central claim that VideoLLaMA 2 'consistently achieves competitive results' and 'sets a new standard'.

minor comments (2)

[Abstract] Abstract: The phrasing 'setting a new standard for intelligent video analysis systems' overstates the results, which are described only as 'competitive' and 'close to some proprietary models'.
[Throughout] Throughout: Define all acronyms (MC-VQA, OE-VQA, VC, AQA, OE-AVQA) on first use and ensure consistent capitalization of 'VideoLLaMA 2'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment below and have revised the manuscript accordingly to improve reproducibility, attribution of results, and statistical rigor where feasible.

read point-by-point responses

Referee: [§4 (Experiments)] The reported benchmark results compare VideoLLaMA 2 against other open-source and proprietary models without controlled ablations that hold training data volume/quality, base LLM scale, and optimization fixed while isolating the STC connector and Audio Branch. This makes it impossible to attribute the claimed 'reasonable improvements' specifically to the proposed additions rather than confounding factors.

Authors: We agree that isolating the contributions of the STC connector and Audio Branch via fully controlled ablations (fixed data, base LLM, and optimization) would strengthen causal attribution. In the original manuscript we compared against models of comparable scale and training regimes, but we have now added a dedicated ablation study section (new Table 4 and accompanying text) that trains variants with and without STC and with/without the Audio Branch on the same data and base model. These results show consistent gains attributable to each component. revision: yes
Referee: [§3 (Method)] The description of the STC connector (e.g., kernel sizes, stride, how it interfaces with the vision encoder and LLM) and the Audio Branch (e.g., fusion mechanism, joint training objective) is high-level; without equations or architectural diagrams that allow precise reproduction, the novelty and load-bearing role of these components cannot be assessed.

Authors: We acknowledge the description was high-level. In the revised manuscript we have expanded §3 with explicit equations for the STC connector (including 3D convolution kernel sizes of 3×3×3, strides, padding, and the exact reshaping that maps vision-encoder features to LLM token space) and for the Audio Branch (cross-attention fusion and the joint training loss combining video and audio objectives). We have also added a detailed architectural diagram (new Figure 2) showing all interfaces. revision: yes
Referee: [Tables 1–3] Table 1–3 (benchmark results): No statistical significance tests, error bars, or multiple-run averages are reported, and baseline details (data splits, exact training recipes) are omitted. This weakens the central claim that VideoLLaMA 2 'consistently achieves competitive results' and 'sets a new standard'.

Authors: We recognize that reporting variance and significance would increase confidence. Due to the high computational cost of re-running all baselines multiple times, we have added (i) precise data-split and training-recipe details to the appendix, (ii) error bars from three random seeds for our own model on the main tables, and (iii) a note on the single-run nature of most competing open-source results. Full multi-seed re-evaluation of every baseline remains resource-prohibitive but we believe the added details address the core concern. revision: partial

Circularity Check

0 steps flagged

Minor self-citation to predecessor without load-bearing circularity in empirical claims

full rationale

The paper builds on the authors' prior VideoLLaMA work by adding an STC connector and audio branch, then reports competitive results on MC-VQA, OE-VQA, VC, AQA and OE-AVQA benchmarks. These performance claims rest on new evaluations rather than any derivation that reduces by construction to previously fitted quantities or self-cited premises. The self-citation is acknowledged but does not justify the central results; the new components are presented as architectural extensions whose value is measured externally. No equations, uniqueness theorems, or predictions collapse to inputs, satisfying the criteria for a low (non-circular) score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the empirical effectiveness of two newly introduced architectural pieces (STC connector and Audio Branch) whose benefits are demonstrated only through end-to-end benchmark scores.

invented entities (2)

Spatial-Temporal Convolution (STC) connector no independent evidence
purpose: Capture intricate spatial and temporal dynamics of video data
Introduced as a tailor-made module; no independent evidence outside the current model training is provided.
Audio Branch no independent evidence
purpose: Enrich multimodal understanding by incorporating audio cues via joint training
New component added through joint training; effectiveness shown only in the reported benchmarks.

pith-pipeline@v0.9.0 · 5552 in / 1069 out tokens · 23447 ms · 2026-05-11T02:40:13.755171+00:00 · methodology

discussion (0)

Forward citations

Cited by 58 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
cs.CV 2026-04 unverdicted novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
cs.CV 2026-05 unverdicted novelty 7.0

TB-AVA uses text as a semantic anchor with a new Text-Bridged Audio-Visual Adapter and Gated Semantic Modulation to achieve state-of-the-art results on audio-visual benchmarks through parameter-efficient fine-tuning.
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
cs.CV 2026-05 unverdicted novelty 7.0

MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassi...
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 conditional novelty 7.0

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

TOC-Bench is an object-track-grounded benchmark that filters for temporally dependent questions and shows Video-LLMs have major weaknesses in event counting, ordering, identity reasoning, and hallucination detection.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs
cs.CV 2026-05 unverdicted novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temp...
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

VideoRouter uses query-adaptive semantic and image routers plus new training datasets to reduce visual tokens by up to 67.9% while improving performance over the InternVL baseline on long-video benchmarks.
Membership Inference Attacks Against Video Large Language Models
cs.CR 2026-04 unverdicted novelty 7.0

A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
cs.RO 2026-04 unverdicted novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
cs.CV 2026-04 conditional novelty 7.0

Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
cs.CV 2026-04 unverdicted novelty 7.0

By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding
cs.CV 2026-04 unverdicted novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
cs.CV 2026-04 unverdicted novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
cs.CV 2026-04 unverdicted novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
cs.CV 2026-05 unverdicted novelty 6.0

Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
cs.CV 2026-05 unverdicted novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
cs.CV 2026-05 unverdicted novelty 6.0

TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
Probing Cross-modal Information Hubs in Audio-Visual LLMs
cs.AI 2026-05 unverdicted novelty 6.0

AVLLMs encode integrated audio-visual information primarily in specialized cross-modal sink tokens, which enables a training-free hallucination mitigation approach.
Probing Cross-modal Information Hubs in Audio-Visual LLMs
cs.AI 2026-05 unverdicted novelty 6.0

AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
cs.AI 2026-05 unverdicted novelty 6.0

Separate modality-specific reasoning before fusion reduces hallucinations and improves accuracy in audio-visual LLMs by enforcing isolated traces then integrating evidence.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation
cs.CL 2026-05 unverdicted novelty 6.0

WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
From Priors to Perception: Grounding Video-LLMs in Physical Reality
cs.CV 2026-05 unverdicted novelty 6.0

Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard L...
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs
cs.CV 2026-05 unverdicted novelty 6.0

VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness
cs.CV 2026-05 unverdicted novelty 6.0

EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation
cs.CV 2026-04 unverdicted novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...
Exploring Audio Hallucination in Egocentric Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
cs.CV 2026-04 unverdicted novelty 6.0

AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent ...
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
cs.CV 2026-04 conditional novelty 6.0

Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 5.0

Response-G1 uses query-guided scene graph generation, memory retrieval, and retrieval-augmented prompting to improve proactive response timing in streaming video understanding.
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
cs.LG 2026-04 unverdicted novelty 5.0

CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and recon...
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
cs.CV 2026-04 unverdicted novelty 5.0

AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognitio...
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 53 Pith papers · 23 internal anchors

[1]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Cail- lon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. Musiclm: Generating music from text.arXiv preprint arXiv:2301.11325,

work page internal anchor Pith review arXiv
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Ruther- ford,SerkanCabi,TengdaHan,ZhitaoGong,SinaSamangooei,MarianneMonteiro,Jacob Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Miko- laj Binkowski, Ricardo ...

work page internal anchor Pith review arXiv
[3]

Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens

Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, and Mohamed Elhoseiny. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens.arXiv preprint arXiv:2404.03413,

work page arXiv
[4]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

JinzeBai,ShuaiBai,ShushengYang,ShijieWang,SinanTan,PengWang,JunyangLin,Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

19 Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023a. Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions.arXiv preprin...

work page internal anchor Pith review arXiv
[6]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuan- jun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2401.16420 , year=

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420,

work page arXiv
[8]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. Clotho: An audio captioning dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740. IEEE,

work page 2020
[9]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis.arXiv preprint arXiv:2403.03206,

work page internal anchor Pith review arXiv
[11]

Videoagent: A memory- augmented multimodal agent for video understanding, arXiv preprint arXiv:2403.11481, 2024

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoa- gent: A memory-augmented multimodal agent for video understanding.arXiv preprint arXiv:2403.11481,

work page arXiv
[12]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

20 Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video understand- ing. arXiv preprint arXiv:2406.14515,

work page arXiv
[13]

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Chaoyou Fu, Yuhan Dai, Yondong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis.arXiv preprint arXiv:2405.21075,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Gemma: Open Models Based on Gemini Research and Technology

Thomas Gemma, Teamand Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupati- raju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Vocalsound: A dataset for improving human vocal sounds recognition

Yuan Gong, Jin Yu, and James Glass. Vocalsound: A dataset for improving human vocal sounds recognition. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 151–155. IEEE,

work page 2022
[16]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team Google. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Ma-lmm: Memory-augmented large multimodal model for long-term video understanding.arXiv preprint arXiv:2404.05726, 2024

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding.arXiv preprint arXiv:2404.05726,

work page arXiv
[19]

Vtimellm: Empower LLM to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments.arXiv preprint arXiv:2311.18445, 2(3):9,

work page arXiv
[20]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, DiegodelasCasas, FlorianBressand, GiannaLengyel, GuillaumeLample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

arXiv preprint arXiv:2311.08046 , year=

BuJin,XinyuLiu,YupengZheng,PengfeiLi,HaoZhao,TongZhang,YuhangZheng,Guyue Zhou, and Jingjing Liu. Adapt: Action-aware driving caption transformer. In2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 7554–7561. IEEE, 2023a. Peng Jin, Ryuichi Takanobu, Caiwan Zhang, Xiaochun Cao, and Li Yuan. Chat-univi: Unified visual representation ...

work page arXiv
[23]

Pegasus-v1 technical report.arXiv preprint arXiv:2404.14687,

Raehyuk Jung, Hyojun Go, Jaehyuk Yi, Jiho Jang, Daniel Kim, Jay Suh, Aiden Lee, Cooper Han, Jae Lee, Jeff Kim, et al. Pegasus-v1 technical report.arXiv preprint arXiv:2404.14687,

work page arXiv
[24]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review arXiv
[25]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 119–132,

work page 2019
[26]

Videopoet: A large language model for zero-shot video generation.arXiv:2312.14125,

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125,

work page arXiv
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a lar...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Yunxin Li, Xinyu Chen, Baotian Hu, Longyue Wang, Haoyuan Shi, and Min Zhang. Video- vista: A versatile benchmark for video understanding and reasoning, 2024c. 22 BinLin,BinZhu,YangYe,MunanNing,PengJin,andLiYuan. Video-llava: Learningunited visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023a. Ji Lin, Hongxu Yin, Wei ...

work page internal anchor Pith review arXiv
[29]

Mm-vid: Advancing video understanding with gpt-4v (ision).arXiv preprint arXiv:2310.19773, 2023b

Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, et al. Mm-vid: Advancing video understanding with gpt-4v (ision).arXiv preprint arXiv:2310.19773, 2023b. Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al...

work page doi:10.23919/eusipco55093.2022.9909680 2022
[30]

arXiv preprint arXiv:2306.09093 , year=

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, ShumingShi,andZhaopengTu. Macaw-llm: Multi-modallanguagemodelingwithimage, audio, video, and text integration.arXiv preprint arXiv:2306.09093,

work page arXiv
[31]

Vista- llama: Reliable video narrator via equal distance to visual tokens,

FanMa,XiaojieJin,HengWang,YuchenXian,JiashiFeng,andYiYang. Vista-llama: Reliable video narrator via equal distance to visual tokens.arXiv preprint arXiv:2312.08870,

work page arXiv
[32]

Video-chatgpt: Towards detailed video understanding via large vision and language models.arXiv preprint arXiv:2306.05424, 2023

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023a. 23 Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video- chatgpt: Towards detailed video understanding via large vision and language mo...

work page arXiv
[33]

Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research.arXiv preprint arXiv:2303.17395,

Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, Yuexian Zou, and Wenwu Wang. Wavcaps: A chatgpt-assisted weakly- labelled audio captioning dataset for audio-language multimodal research.arXiv preprint arXiv:2303.17395,

work page arXiv
[34]

Snap video: Scaled spatiotemporal transformers for text-to-video synthesis,

Willi Menapace, Aliaksandr Siarohin, Ivan Skorokhodov, Ekaterina Deyneka, Tsai-Shien Chen,AnilKag,YuweiFang,AlekseiStoliar,ElisaRicci,JianRen,etal. Snapvideo: Scaled spatiotemporal transformers for text-to-video synthesis.arXiv preprint arXiv:2402.14797,

work page arXiv
[35]

Tut database for acoustic scene classification and sound event detection

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. Tut database for acoustic scene classification and sound event detection. In2016 24th European Signal Processing Conference (EUSIPCO), pp. 1128–1132, 2016a. doi: 10.1109/EUSIPCO.2016.7760424. Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. TUT database for acoustic scene classification and soun...

work page doi:10.1109/eusipco.2016.7760424 2016
[36]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023a. OpenAI. Gpt-4v(ision) system card, 2023b. URLhttps://openai.com/research/ gpt-4v-system-card. OpenAI. Gpt-4o system card,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

X-InstructBLIP: A Framework for Aligning X-Modal Instruction-Aware Representations to LLMs and Emergent Cross-modal Reasoning

URL https://openai.com/index/ hello-gpt-4o/. Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li, Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799,

work page arXiv
[38]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

24 Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238,

work page internal anchor Pith review arXiv
[39]

Reka core, flash, and edge: A series of powerful multi- modal language models

Reka team Reka. Reka core, flash, and edge: A series of powerful multimodal language models. arXiv preprint arXiv:2404.12387,

work page arXiv
[40]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multi- modallargelanguagemodelforlongvideounderstanding. arXivpreprintarXiv:2312.02051 ,

work page arXiv
[41]

Yihua Shao, Hongyi Cai, Wenxin Long, Weiyi Lang, Zhe Wang, Haoran Wu, Yan Wang, YangYang,andZhenLei

ISBN 9781450330633. Yihua Shao, Hongyi Cai, Wenxin Long, Weiyi Lang, Zhe Wang, Haoran Wu, Yan Wang, YangYang,andZhenLei. Accidentblip2: Accidentdetectionwithmulti-viewmotionblip2. arXiv preprint arXiv:2404.12149,

work page arXiv
[42]

Audio-visual llm for video understanding

Fangxun Shu, Lei Zhang, Hao Jiang, and Cihang Xie. Audio-visual llm for video under- standing. arXiv preprint arXiv:2312.06720,

work page arXiv
[43]

Moviechat: From dense to- ken to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449, 2023

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, et al. Moviechat: From dense token to sparse memory for long video understanding.arXiv preprint arXiv:2307.16449,

work page arXiv
[44]

Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,

work page arXiv
[45]

Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue,

Yunlong Tang, Daiki Shimada, Jing Bi, and Chenliang Xu. Avicuna: Audio-visual llm with interleaver and context-boundary alignment for temporal referential dialogue.arXiv preprint arXiv:2403.16276,

work page arXiv
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Apictureisworthmorethan77texttokens: Evaluatingclip-stylemodels on dense captions.arXiv preprint arXiv:2312.08578,

Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. Apictureisworthmorethan77texttokens: Evaluatingclip-stylemodels on dense captions.arXiv preprint arXiv:2312.08578,

work page arXiv
[48]

Tarsier: Recipes for training and evaluating large video description models,

Jiawei Wang, Liping Yuan, and Yuchen Zhang. Tarsier: Recipes for training and evaluating large video description models, 2024a. URLhttps://arxiv.org/abs/2407.00634. Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang,LeiZhao,XixuanSong,etal. Cogvlm: Visualexpertforpretrainedlanguagemodels. arXiv preprint arXiv:2311.03...

work page arXiv
[49]

arXiv preprint arXiv:2307.06942 (2023)

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multi- modal understanding and generation.arXiv preprint arXiv:2307.06942, 2023b. Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang...

work page arXiv
[50]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a. Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE...

work page internal anchor Pith review arXiv
[51]

arXiv preprint arXiv:2404.16994 , year=

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024a. Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch b...

work page arXiv
[52]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, and Xiaochun Cao. Cat: Enhancing multimodal large language model to answer questions in dynamic audio-visual scenarios, 2024a. 26 Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yi Zhou, Junyan Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qiang...

work page Pith review arXiv
[53]

Clevrer: Collision events for video representation and reasoning

Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, and Fei Huang. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13040–13051, 2024b. Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu...

work page arXiv 1910
[54]

Yi: Open Foundation Models by 01.AI

ai. arXiv preprint arXiv:2403.04652,

work page internal anchor Pith review arXiv
[55]

Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion.arXiv preprint arXiv:2402.05889,

Shoubin Yu, Jaehong Yoon, and Mohit Bansal. Crema: Multimodal compositional video reasoning via efficient modular adaptation and fusion.arXiv preprint arXiv:2402.05889,

work page arXiv
[56]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858,

work page internal anchor Pith review arXiv
[57]

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264,

work page internal anchor Pith review arXiv
[58]

Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang,JunwuZhang,ZongweiLi,etal. Languagebind: Extendingvideo-languagepretrain- ing to n-modality by language-based semantic alignment.arXiv preprint arXiv:2310.01852, 2023a. Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-l...

work page arXiv 2024