hub

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen · 2024 · cs.CV · arXiv 2406.07476

53 Pith papers cite this work. Polarity classification is still indexing.

53 Pith papers citing it

open full Pith review browse 53 citing papers arXiv PDF

abstract

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 background 1

citation-polarity summary

baseline 3 background 1

claims ledger

abstract In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Com

co-cited works

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.

MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassing GPT-5.4.

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

cs.CV · 2026-05-11 · conditional · novelty 7.0 · 2 refs

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

Membership Inference Attacks Against Video Large Language Models

cs.CR · 2026-04-29 · unverdicted · novelty 7.0

A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

cs.RO · 2026-04-19 · unverdicted · novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

cs.CV · 2026-04-17 · conditional · novelty 7.0

Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.

Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.

Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.

AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.

Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

Video-R1: Reinforcing Video Reasoning in MLLMs

cs.CV · 2025-03-27 · conditional · novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.

Probing Cross-modal Information Hubs in Audio-Visual LLMs

cs.AI · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Separate modality-specific reasoning before fusion reduces hallucinations and improves accuracy in audio-visual LLMs by enforcing isolated traces then integrating evidence.

citing papers explorer

Showing 50 of 53 citing papers.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 85 · internal anchor
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models cs.CV · 2026-04-19 · unverdicted · none · ref 9 · internal anchor
VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding cs.CV · 2026-05-13 · unverdicted · none · ref 3 · internal anchor
AdaFocus achieves better accuracy on long-video benchmarks with roughly 33 times fewer visual tokens by combining query-aware adaptive sampling and zero-cache disk-based refinement.
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection cs.CV · 2026-05-11 · unverdicted · none · ref 9 · internal anchor
MMVIAD is the first multi-view continuous video dataset for industrial anomaly detection with four supported tasks, and the VISTA model improves average benchmark scores from 45.0 to 57.5 on unseen data while surpassing GPT-5.4.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 7 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.
Membership Inference Attacks Against Video Large Language Models cs.CR · 2026-04-29 · unverdicted · none · ref 12 · internal anchor
A temperature-perturbed black-box attack infers video training membership in VideoLLMs with 0.68 AUC by exploiting sharper generation behavior on member samples.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning cs.RO · 2026-04-19 · unverdicted · none · ref 26 · internal anchor
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions cs.CV · 2026-04-17 · conditional · none · ref 10 · internal anchor
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
Chain of Modality: From Static Fusion to Dynamic Orchestration in Omni-MLLMs cs.CV · 2026-04-16 · unverdicted · none · ref 4 · internal anchor
Chain of Modality dynamically orchestrates multimodal input topologies and bifurcates cognitive execution to overcome static fusion biases in Omni-MLLMs.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models cs.CV · 2026-04-15 · unverdicted · none · ref 11 · internal anchor
Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning cs.CV · 2026-04-13 · unverdicted · none · ref 6 · internal anchor
By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
AdaSpark: Adaptive Sparsity for Efficient Long-Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 9 · internal anchor
AdaSpark delivers up to 57% FLOP reduction in Video-LLMs for long videos through adaptive cube- and token-level sparsity without apparent loss in performance on hour-scale benchmarks.
Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding cs.CV · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
Bridge-STG decouples spatio-temporal alignment via semantic bridging and query-guided localization modules to achieve state-of-the-art m_vIoU of 34.3 on VidSTG among MLLM methods.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 7 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 7 · internal anchor
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
Video-R1: Reinforcing Video Reasoning in MLLMs cs.CV · 2025-03-27 · conditional · none · ref 4 · internal anchor
Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 9 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority cs.CV · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Decoupling planning from answer authority in long-video agents reduces evidence misalignment and raises accuracy to 55.1% on LVBench and 62.0% on LongVideoBench.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs cs.CV · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning cs.CV · 2026-05-12 · unverdicted · none · ref 5 · 2 links · internal anchor
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
Probing Cross-modal Information Hubs in Audio-Visual LLMs cs.AI · 2026-05-11 · unverdicted · none · ref 2 · 2 links · internal anchor
AVLLMs store integrated audio-visual information mainly in a distinct subset of sink tokens called cross-modal sink tokens, which can be leveraged for training-free hallucination mitigation.
Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought cs.AI · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
Separate modality-specific reasoning before fusion reduces hallucinations and improves accuracy in audio-visual LLMs by enforcing isolated traces then integrating evidence.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 48 · 2 links · internal anchor
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation cs.CL · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 43 · 2 links · internal anchor
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
From Priors to Perception: Grounding Video-LLMs in Physical Reality cs.CV · 2026-05-06 · unverdicted · none · ref 10 · internal anchor
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization cs.CV · 2026-05-04 · unverdicted · none · ref 7 · internal anchor
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
Beyond Perceptual Shortcuts: Causal-Inspired Debiasing Optimization for Generalizable Video Reasoning in Lightweight MLLMs cs.CV · 2026-05-02 · unverdicted · none · ref 43 · internal anchor
VideoThinker improves lightweight MLLM video reasoning by creating a bias model to capture shortcuts and applying causal debiasing policy optimization to push away from them, achieving SOTA efficiency with minimal data.
EmoMM: Benchmarking and Steering MLLM for Multimodal Emotion Recognition under Conflict and Missingness cs.CV · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 24 · internal anchor
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
Exploring Audio Hallucination in Egocentric Video Understanding cs.CV · 2026-04-26 · unverdicted · none · ref 14 · internal anchor
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
Video-ToC: Video Tree-of-Cue Reasoning cs.CV · 2026-04-22 · unverdicted · none · ref 49 · internal anchor
Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers cs.CV · 2026-04-17 · unverdicted · none · ref 3 · internal anchor
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 12 · internal anchor
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models cs.CV · 2026-04-14 · unverdicted · none · ref 5 · internal anchor
Decoder-side Temporal Rebalancing (DTR) reduces hallucinations in Video-LLMs by mitigating over-dominance of a single anchor frame during inference without training or auxiliary models.
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment cs.CV · 2026-04-10 · conditional · none · ref 2 · internal anchor
Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 7 · internal anchor
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 11 · internal anchor
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 7 · internal anchor
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering cs.CV · 2026-04-02 · unverdicted · none · ref 3 · internal anchor
STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 17 · internal anchor
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 23 · internal anchor
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 38 · internal anchor
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 10 · internal anchor
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 7 · internal anchor
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective cs.LG · 2026-04-20 · unverdicted · none · ref 280 · internal anchor
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and reconstruction constraints.
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition cs.CV · 2026-04-14 · unverdicted · none · ref 7 · internal anchor
AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognition on MER-UniBench.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer