WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
Ee-mllm: A data-efficient and compute-efficient multimodal large language model
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CV 2verdicts
UNVERDICTED 2representative citing papers
Visual tokens enter VLMs as raw signals and are reshaped differently by in-context versus layer-injection paradigms, each capturing distinct frequency characteristics that drive task performance.
citing papers explorer
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
The Hidden Evolution of Disguised Visual Context inside the VLM
Visual tokens enter VLMs as raw signals and are reshaped differently by in-context versus layer-injection paradigms, each capturing distinct frequency characteristics that drive task performance.