A very big video reasoning suite

Maijunxian Wang, Ruisi Wang, Juyi Lin, Ran Ji, Thaddäus Wiedemer, Qingying Gao, Dezhi Luo, Yaoyao Qian, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Jiachen Li, Hanwen Xing, Tianqi Zhao, Feng · 2026 · arXiv 2602.20159

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

PhyGround: Benchmarking Physical Reasoning in Generative World Models

cs.CV · 2026-05-11 · accept · novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

Do multimodal models imagine electric sheep?

cs.CV · 2026-05-10 · conditional · novelty 6.0

Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

citing papers explorer

Showing 7 of 7 citing papers.

PhyGround: Benchmarking Physical Reasoning in Generative World Models cs.CV · 2026-05-11 · accept · none · ref 44
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models cs.CV · 2026-05-09 · unverdicted · none · ref 33
CollabVR improves video reasoning performance by coupling vision-language models and video generation models in a closed-loop step-level collaboration that detects and repairs generation failures.
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models cs.CV · 2026-05-05 · unverdicted · none · ref 29 · 2 links
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors cs.CV · 2026-05-11 · unverdicted · none · ref 24
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
Do multimodal models imagine electric sheep? cs.CV · 2026-05-10 · conditional · none · ref 46
Fine-tuning VLMs to output action sequences for puzzles causes emergent internal visual representations that improve performance when integrated into reasoning.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 68
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 133
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

A very big video reasoning suite

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer