citation dossier

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al · 2025 · arXiv 2503.21755

20Pith papers citing it

21reference links

cs.CVtop field · 18 papers

UNVERDICTEDtop verdict bucket · 18 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 20 reviewed papers. Its strongest current cluster is cs.CV (18 papers). The largest review-status bucket among citing papers is UNVERDICTED (18 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

PhyGround: Benchmarking Physical Reasoning in Generative World Models

cs.CV · 2026-05-11 · accept · novelty 7.0

PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

cs.CV · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

AnimationBench: Are Video Models Good at Character-Centric Animation?

cs.CV · 2026-04-16 · unverdicted · novelty 7.0

AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

cs.CV · 2026-04-08 · unverdicted · novelty 7.0

Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.

WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation

cs.CV · 2026-04-28 · unverdicted · novelty 6.0

HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt benchmark.

Seeing Fast and Slow: Learning the Flow of Time in Videos

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.

Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.

VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation

cs.CV · 2026-04-11 · unverdicted · novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.

ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.

LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore)

cs.CV · 2026-05-06 · conditional · novelty 5.0

The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.

World Action Models: The Next Frontier in Embodied AI

cs.RO · 2026-05-12 · unverdicted · novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE

cs.CV · 2026-05-04 · unverdicted · novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Evolution of Video Generative Foundations

cs.CV · 2026-04-07 · unverdicted · novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

citing papers explorer

Showing 20 of 20 citing papers.

PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 102
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
PhyGround: Benchmarking Physical Reasoning in Generative World Models cs.CV · 2026-05-11 · accept · none · ref 51
PhyGround is a new benchmark with curated prompts, a 13-law taxonomy, large-scale human annotations, and an open physics-specialized VLM judge for evaluating physical reasoning in generative video models.
WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models cs.CV · 2026-05-05 · unverdicted · none · ref 13 · 2 links
WorldJen is a new benchmark for generative video models that uses VLM-judged multi-dimensional Likert questionnaires validated against human preferences to achieve perfect tier agreement.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV · 2026-04-26 · unverdicted · none · ref 54
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation cs.RO · 2026-04-21 · unverdicted · none · ref 60
RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
AnimationBench: Are Video Models Good at Character-Centric Animation? cs.CV · 2026-04-16 · unverdicted · none · ref 17
AnimationBench is the first benchmark that operationalizes the twelve basic principles of animation and IP preservation into scalable, VLM-assisted metrics for animation-style I2V generation.
Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis cs.CV · 2026-04-08 · unverdicted · none · ref 34
Grounded Forcing introduces dual memory caching, reference-based positional embeddings, and proximity-weighted recaching to bridge stable semantics with local dynamics, improving long-range consistency in autoregressive video synthesis.
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors cs.CV · 2026-05-11 · unverdicted · none · ref 30
The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.
SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models cs.CV · 2026-05-08 · unverdicted · none · ref 34
SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.
HuM-Eval: A Coarse-to-Fine Framework for Human-Centric Video Evaluation cs.CV · 2026-04-28 · unverdicted · none · ref 30
HuM-Eval evaluates human motion videos with a coarse-to-fine approach using VLM global checks plus 2D pose and 3D motion analysis, reaching 58.2% average correlation with human judgments and introducing a 1000-prompt benchmark.
Seeing Fast and Slow: Learning the Flow of Time in Videos cs.CV · 2026-04-23 · unverdicted · none · ref 67
Self-supervised models learn to perceive and manipulate the flow of time in videos, supporting speed detection, large-scale slow-motion data curation, and temporally controllable video synthesis.
Long-CODE: Isolating Pure Long-Context as an Orthogonal Dimension in Video Evaluation cs.CV · 2026-04-19 · unverdicted · none · ref 34
Long-CODE isolates long-context video evaluation with a new benchmark dataset and shot-dynamics metric that correlates better with human judgments on narrative richness and global consistency than short-video metrics.
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation cs.CV · 2026-04-11 · unverdicted · none · ref 44
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks cs.CV · 2026-04-09 · unverdicted · none · ref 46
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 50
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore) cs.CV · 2026-05-06 · conditional · none · ref 43
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics cs.CV · 2026-04-09 · unverdicted · none · ref 49
Phantom generates visually realistic and physically consistent videos by jointly modeling visual content and latent physical dynamics via an abstract physics-aware representation.
World Action Models: The Next Frontier in Embodied AI cs.RO · 2026-05-12 · unverdicted · none · ref 214
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE cs.CV · 2026-05-04 · unverdicted · none · ref 47
Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 167
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer