AVLLMs encode audio semantics in middle layers but suppress them in final text outputs when audio conflicts with vision, due to training that largely inherits from vision-language base models.
hub
video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Sandboxed coding agents with text+image access match or outperform native omnimodal models on audio-video benchmarks by converting tasks into code-driven retrieval and processing.
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
CapRiCorn-1K benchmark shows current video captioning models produce inaccurate and inconsistent captions that worsen with longer videos, with proposed metrics correlating to downstream task performance.
V-LynX integrates novel modalities into frozen Video LLMs by aligning to an internalized continuous token manifold using unpaired unimodal data and attention/statistical matching.
EmoMM benchmark reveals Video Contribution Collapse in MLLMs for emotion recognition under modality conflict and missingness, mitigated by CHASE head-level attention steering.
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.
citing papers explorer
-
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
-
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
XModBench is a tri-modal benchmark that systematically measures cross-modal consistency, modality disparities, and directional imbalances in omni-language models across five task families and all modality combinations.
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models
OmniZip introduces an audio-guided dynamic token compression framework that achieves 3.42X inference speedup and 1.4X memory reduction for omnimodal LLMs without any training.
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
-
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.