Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.
Mixed citations
A very big video reasoning suite
Mixed citation behavior. Most common role is background (67%).
citation-role summary
citation-polarity summary
years
2026 11representative citing papers
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
PaintBench provides a scalable deterministic benchmark for precise visual editing operations, revealing that even the best of 11 models achieves only 17.1% mIoU and that scores correlate strongly with applied data visualization editing performance.
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
citing papers explorer
-
Video-MME-Logical: A Controlled Diagnostic Benchmark for Video Temporal-Logical Reasoning
Introduces Video-MME-Logical benchmark for controlled diagnostic evaluation of temporal-logical reasoning in MLLMs via five operations and 25 fine-grained tasks.
-
PhyGround: Benchmarking Physical Reasoning in Generative World Models
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
-
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
-
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
-
Video Models Can Reason with Verifiable Rewards
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.
-
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
-
Do multimodal models imagine electric sheep?
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
PaintBench: Deterministic Evaluation of Precise Visual Editing
PaintBench provides a scalable deterministic benchmark for precise visual editing operations, revealing that even the best of 11 models achieves only 17.1% mIoU and that scores correlate strongly with applied data visualization editing performance.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
- Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction