Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
SoccerRef-Agents is a multi-agent framework using MLLMs, cross-modal RAG, and a custom knowledge base that outperforms general MLLMs on soccer foul decisions and explanations.
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.
citing papers explorer
-
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
-
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
-
SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing
SoccerRef-Agents is a multi-agent framework using MLLMs, cross-modal RAG, and a custom knowledge base that outperforms general MLLMs on soccer foul decisions and explanations.
-
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.