Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos

He, X · 2024 · arXiv 2406.08407

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

cs.CV · 2025-05-27 · conditional · novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing

cs.AI · 2026-04-25 · unverdicted · novelty 6.0

SoccerRef-Agents is a multi-agent framework using MLLMs, cross-modal RAG, and a custom knowledge base that outperforms general MLLMs on soccer foul decisions and explanations.

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

citing papers explorer

Showing 4 of 4 citing papers.

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? cs.CV · 2025-05-27 · conditional · none · ref 11
Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 25
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
SoccerRef-Agents: Multi-Agent System for Automated Soccer Refereeing cs.AI · 2026-04-25 · unverdicted · none · ref 13
SoccerRef-Agents is a multi-agent framework using MLLMs, cross-modal RAG, and a custom knowledge base that outperforms general MLLMs on soccer foul decisions and explanations.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 61
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

Mmworld: Towards multi-discipline multi-faceted world model evaluation in videos

fields

years

verdicts

representative citing papers

citing papers explorer