arxiv: 2605.14269 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Yidong Huang , Zun Wang , Han Lin , Dong-Ki Kim , Shayegan Omidshafiei , Jaehong Yoon , Jaemin Cho , Yue Zhang

show 1 more author

Mohit Bansal

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human video generationmotion rewardphysics simulationreinforcement learningSMPLMuJoCo3D motion evaluationvideo post-training

0 comments

The pith

PhyMotion scores recovered 3D human meshes in a physics simulator to reward realistic motion in generated videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PhyMotion, a reward that extracts SMPL body meshes from video, retargets them into the MuJoCo simulator, and scores motion on three physical axes. The goal is to give reinforcement learning a signal that penalizes floating bodies, poor balance, and impossible dynamics instead of relying only on 2D perceptual cues. When this reward is optimized, video generators produce human motion that aligns better with human judgments than motion produced by prior rewards. The gains appear in both autoregressive and bidirectional models and survive automatic metrics plus blind human preference tests.

Core claim

PhyMotion recovers SMPL meshes from generated videos, retargets them onto a humanoid in MuJoCo, and evaluates the resulting motion along kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each axis supplies a continuous, interpretable score tied to a distinct physical property. The combined reward correlates more strongly with human judgments of realism than existing formulations and, when used for RL post-training, produces larger and more consistent motion improvements across generator architectures while preserving overall video quality.

What carries the argument

PhyMotion reward: structured scorer that retargets SMPL meshes from video into MuJoCo to compute separate kinematic, contact-balance, and dynamic feasibility scores.

If this is right

Optimizing PhyMotion in RL post-training yields larger motion-realism gains than optimizing existing 2D perceptual rewards.
The three scoring axes supply complementary signals that allow diagnosis of specific motion flaws.
Improvements hold for both autoregressive and bidirectional video generators.
Overall video quality remains intact with only modest extra training cost.
Blind human raters show a +68 Elo gain in motion preference for PhyMotion-trained outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Direct integration of similar physics-based rewards into the base training loop could reduce the need for separate post-training stages.
Better 3D mesh recovery methods would raise the ceiling of what this reward can achieve.
The same retarget-and-simulate approach could be tested on non-human articulated motion such as animals or robots.
Explicit 3D physics modeling may become necessary for any generative model that aims at high-fidelity human movement.

Load-bearing premise

SMPL mesh recovery from generated videos is accurate enough and retargeting those meshes into MuJoCo faithfully reflects the physical violations that human viewers notice.

What would settle it

Generate videos containing deliberate physical errors such as floating bodies or impossible joint angles, then check whether PhyMotion consistently assigns them lower scores than matched realistic videos and whether optimizing it reduces those errors.

Figures

Figures reproduced from arXiv: 2605.14269 by Dong-Ki Kim, Han Lin, Jaehong Yoon, Jaemin Cho, Mohit Bansal, Shayegan Omidshafiei, Yidong Huang, Yue Zhang, Zun Wang.

**Figure 1.** Figure 1: Overview of PhyMotion. Existing metrics operate in 2D pixel space (VLMs [Liu et al., 2025b], frame-level classifiers [Zheng et al., 2025]), producing misleading scores that are structurally blind to articulated motion. We recover SMPL meshes from generated videos and place them in a physics simulator, yielding decomposed, physically-grounded motion scores. The structured reward drives RL-based post-trainin… view at source ↗

**Figure 2.** Figure 2: PhyMotion identifies distinct physical failure modes overlooked by existing 2D metrics. Each example shows a generated video and its recovered SMPL trajectory. Colored SMPL frames highlight the dominant motion errors: kinematic (articulated-body inconsistency), contact (unstable body–environment interaction), and dynamic (physically infeasible motion). All reported metric values are quartile-normalized z-s… view at source ↗

**Figure 3.** Figure 3: Agreement with human judgments for motion quality. Our metrics achieve the highest agreement compared with perceptual (VBench / VBench2) and learned (VideoAlign, VideoPhy) video metrics across three human-evaluation questions: body structure, balance, and motion naturalness [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Winning fraction heatmap of different models. Each cell shows the win rate between models on the overall human preference evaluation, with ties counted as 0.5. As expected, these gains are also reflected in our PhyMotion rewards, with improvements of +3.5% in overall feasibility. The same trend also holds for FastWan 1.3B: after post-training with PhyMotion, the model improves motion smoothness, aesthetic … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on diverse human-motion prompts. We compare our model against baseline models across challenging motion categories. Baseline models often produce artifacts, while our model maintains more physically plausible motion. Reward computation overhead. Finally, we measure both standalone reward-evaluation latency and effective training overhead for PhyMotion and representative reward models… view at source ↗

**Figure 6.** Figure 6: Qualitative examples selected by individual PhyMotion submetrics. Each row shows a generated video that scores poorly on one target submetric while scoring relatively well on the other two. The examples show complementary failures captured by kinematic feasibility, contact/balance feasibility, and dynamic feasibility. All scores are reported in z-scores after normalization. C.2 Study Composition We form th… view at source ↗

**Figure 7.** Figure 7: Per-category comparison on general video benchmarks. We show normalized radar charts for VBench and VBench-2.0 dimensions. Our model remains competitive across general perceptual, temporal, and compositional categories, while showing strong performance on humanand motion-related dimensions. These results support that PhyMotion improves human-centered physical plausibility without substantially sacrificing… view at source ↗

**Figure 8.** Figure 8: Human preference annotation interface. Annotators compare two anonymized videos generated from the same prompt and answer three pairwise preference questions on body structure, balance, and motion naturalness. V = 10,475 vertices and F = 20,908 triangular faces. We then retarget the recovered motion to a MuJoCo human model and compute physics-related quantities such as joint torques and ground reaction for… view at source ↗

read the original abstract

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhyMotion gives a workable three-axis physics reward for human video motion via SMPL and MuJoCo that beats 2D baselines in RL, but the recovery and retargeting steps lack the validation needed to confirm they drive the gains.

read the letter

The main point is that this paper builds a reward by recovering SMPL meshes from generated videos, retargeting them into MuJoCo, and scoring along kinematic plausibility, contact-balance, and dynamic feasibility. That decomposition is the concrete addition over standard 2D perceptual rewards used in video RL post-training. It reports stronger human correlation and a +68 Elo improvement after RL on both autoregressive and bidirectional generators, with ablations indicating the three axes supply complementary signals and only modest extra cost. Those results are the part that lands as useful evidence. The formulation stays grounded in an external simulator rather than fitting directly to the human preference data, which keeps it from obvious circularity. The soft spot is the front end of the pipeline. SMPL recovery on artifact-heavy generated frames can be noisy, and the retargeting step may not faithfully map the physical violations that actually matter to viewers. The abstract gives no error rates on recovered trajectories versus ground truth or sensitivity tests with injected noise, so it remains possible the gains come from enforcing 3D consistency rather than true physics. That assumption is load-bearing for the central claim. This work is aimed at groups doing RL fine-tuning on video models who need better motion signals. Readers focused on interpretable, multi-axis rewards will find the structure and the reported downstream gains worth examining. It has enough empirical grounding and a clear problem statement to merit a serious referee rather than a desk reject. I would send it to peer review, with the main request being quantitative checks on the SMPL and MuJoCo steps.

Referee Report

2 major / 2 minor

Summary. The paper proposes PhyMotion, a structured 3D motion reward for physics-grounded human video generation. It recovers SMPL body meshes from generated videos, retargets them onto a humanoid in MuJoCo, and evaluates the motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Experiments claim stronger correlation with human judgments than existing 2D rewards, plus larger and more consistent RL post-training gains (+68 Elo) across autoregressive and bidirectional generators, with ablations showing complementary signals from the three axes.

Significance. If the SMPL recovery and retargeting steps are shown to be accurate, the work would provide a valuable advance by supplying interpretable, multi-axis physics signals that outperform 2D perceptual rewards for RL in human video generation. The reported human correlation gains, Elo improvements, and ablation results on complementary axes would be a substantive contribution to reward design for motion realism.

major comments (2)

[Method section on 3D recovery and retargeting] The central claim that PhyMotion measures physical violations (rather than 3D consistency proxies) rests on SMPL recovery from generated videos and MuJoCo retargeting, yet no quantitative validation—such as pose/joint error rates on recovered vs. ground-truth trajectories or ablation on retargeting noise—is reported. This is load-bearing for the +68 Elo and human-preference results.
[Experiments section (RL and human eval)] Experiments on RL gains and human evaluation report the +68 Elo improvement and stronger correlation, but lack controls or analysis showing that gains are driven by the physics axes rather than artifacts from SMPL estimation errors on artifact-laden generated frames.

minor comments (2)

[Method] Clarify the exact definitions and weighting of the three reward axes in the method description for reproducibility.
[Experiments] Add error bars or statistical significance tests to the Elo and correlation tables to support the cross-generator claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. The concerns about validation of the SMPL recovery and retargeting pipeline, as well as controls for potential estimation artifacts in the RL and human evaluation results, are well-taken and directly relevant to the strength of our claims. We address each point below and will revise the manuscript accordingly to provide the requested evidence.

read point-by-point responses

Referee: [Method section on 3D recovery and retargeting] The central claim that PhyMotion measures physical violations (rather than 3D consistency proxies) rests on SMPL recovery from generated videos and MuJoCo retargeting, yet no quantitative validation—such as pose/joint error rates on recovered vs. ground-truth trajectories or ablation on retargeting noise—is reported. This is load-bearing for the +68 Elo and human-preference results.

Authors: We agree that quantitative validation of the SMPL recovery accuracy and retargeting robustness is essential to confirm that PhyMotion evaluates genuine physical violations. The current manuscript relies on the downstream human correlation and RL gains as indirect evidence, but does not report direct error metrics. In the revised manuscript we will add: (1) pose and joint position error rates computed on a held-out set of videos with available ground-truth SMPL parameters from motion-capture datasets, and (2) an ablation that injects controlled Gaussian noise into the retargeted joint trajectories and measures the resulting change in reward scores and RL performance. These additions will directly substantiate the load-bearing steps for the reported +68 Elo and preference improvements. revision: yes
Referee: [Experiments section (RL and human eval)] Experiments on RL gains and human evaluation report the +68 Elo improvement and stronger correlation, but lack controls or analysis showing that gains are driven by the physics axes rather than artifacts from SMPL estimation errors on artifact-laden generated frames.

Authors: We acknowledge the need for explicit controls to rule out confounding from SMPL estimation errors on low-quality generated frames. The manuscript currently demonstrates complementary signals via axis ablations but does not isolate estimation noise. In the revision we will include: (1) a control experiment that applies the same RL pipeline using perturbed SMPL estimates (Gaussian noise added to recovered joints) and compares the resulting Elo gains against the original PhyMotion reward, and (2) a per-frame analysis correlating reward scores with known artifact types (e.g., floating bodies) versus clean frames. These controls will show that the observed improvements arise from the kinematic, contact, and dynamic physics signals rather than estimation artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in PhyMotion reward derivation

full rationale

The PhyMotion reward is defined directly from three explicit physics axes (kinematic plausibility, contact/balance, dynamic feasibility) computed in an external MuJoCo simulator after SMPL recovery and retargeting. No equation, parameter fit, or component reduces the final reward value to a fitted quantity derived from the same generated videos or human preference data used for evaluation. The reported gains (+68 Elo, stronger human correlation) are presented as outcomes of optimizing this independent simulator-based signal rather than any self-referential construction or self-citation chain. Self-citations, if present, are not load-bearing for the core reward definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the accuracy of off-the-shelf SMPL recovery and the fidelity of MuJoCo humanoid dynamics; no new free parameters are introduced beyond the weighting of the three reward axes, which are not numerically specified in the abstract.

axioms (2)

domain assumption SMPL body model accurately recovers 3D meshes from 2D video frames
Invoked when the paper states it recovers SMPL meshes from generated videos before retargeting.
domain assumption MuJoCo humanoid simulation faithfully reproduces contact and dynamic constraints relevant to human perception
Central to the claim that the three-axis scores measure physical feasibility.

pith-pipeline@v0.9.0 · 5617 in / 1370 out tokens · 33935 ms · 2026-05-15T02:46:13.090754+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Rmotion(v) = 1/3 (Fkin(v) + Fcon(v) + Fdyn(v))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 5 internal anchors

[1]

Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599,

Zhaochong An, Orest Kupyn, Théo Uscidda, Andrea Colaco, Karan Ahuja, Serge Belongie, Mar Gonzalez-Franco, and Marta Tintore Gazulla. Vggrpo: Towards world-consistent video generation with 4d latent reward.arXiv preprint arXiv:2603.26599,

work page arXiv
[2]

VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492,

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, and Shelly Sheynin. VideoJAM: Joint appearance-motion representations for enhanced motion generation in video models.arXiv preprint arXiv:2502.02492,

work page arXiv
[3]

Seedance 1.0: Exploring the Boundaries of Video Generation Models

Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DayaGuo,DejianYang,HaoweiZhang,JunxiaoSong,RuoyuZhang,RunxinXu,QihaoZhu,Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

GARDO: Reinforcing diffusion models without reward hacking

Haoran He, Yuxiao Ye, Jie Liu, Jiajun Liang, Zhiyong Wang, Ziyang Yuan, Xintao Wang, Hangyu Mao, Pengfei Wan, and Ling Pan. GARDO: Reinforcing diffusion models without reward hacking. arXiv preprint arXiv:2512.24138,

work page arXiv
[6]

VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation.arXiv preprint arXiv:2406.15252,

Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra,ZiyanJiang,AaranArulraj,KaiWang,QuyDucDo,YuanshengNi,BohanLyu,Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Yuchen Lin, and Wenhu Chen. VideoScore: Building automatic metrics to simulate fine-grained human feedback for video generation.arXiv preprint arXiv:2406.15252,

work page arXiv
[7]

Planning with sketch-guided verification for physics-aware video generation

YidongHuang,ZunWang,HanLin,Dong-KiKim,ShayeganOmidshafiei,JaehongYoon,YueZhang, and Mohit Bansal. Planning with sketch-guided verification for physics-aware video generation. arXiv preprint arXiv:2511.17450,

work page arXiv
[8]

Vbench++: Comprehensive and versatile benchmark suite for video generative models

ZiqiHuang,YinanHe,JiashuoYu,FanZhang,ChenyangSi,YumingJiang,YuanhanZhang,Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, YuQiao, andZiweiLiu. VBench: Comprehensivebenchmarksuiteforvideogenerativemodels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024a. Ziqi...

work page arXiv
[9]

Flow-GRPO: Training Flow Matching Models via Online RL

Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-GRPO: Training flow matching models via online RL.arXiv preprint arXiv:2505.05470, 2025a. Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678,

Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al. Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation.arXiv preprint arXiv:2512.04678,

work page arXiv
[11]

doi: 10.1609/aaai.v38i5.28206

ISBN 978-1-57735-887-9. doi: 10.1609/aaai.v38i5.28206. URLhttps://doi.org/10.1609/aaai.v38i5.28206. Yuhang Ma, Yunhao Shui, Xiaoshi Wu, Keqiang Sun, and Hongsheng Li. HPSv3: Towards wide- spectrum human preference score. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV),

work page doi:10.1609/aaai.v38i5.28206
[12]

Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

work page arXiv
[13]

Human4DiT: Free-view human video generation with 4D diffusion transformer.arXiv preprint arXiv:2405.17405, 2024a

Ruizhi Shao, Youxin Pang, Zerong Zheng, Jingxiang Sun, and Yebin Liu. Human4DiT: Free-view human video generation with 4D diffusion transformer.arXiv preprint arXiv:2405.17405, 2024a. ZhihongShao,PeiyiWang,QihaoZhu,RunxinXu,JunxiaoSong,XiaoBi,HaoweiZhang,Mingchuan Zhang, YK Li, Yang Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in ...

work page arXiv
[14]

T2V- CompBench: A comprehensive benchmark for compositional text-to-video generation.arXiv preprint arXiv:2407.14505,

Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, and Xihui Liu. T2V- CompBench: A comprehensive benchmark for compositional text-to-video generation.arXiv preprint arXiv:2407.14505,

work page arXiv
[15]

Wan: Open and Advanced Large-Scale Video Generative Models

14 Team Wan, Ang Wang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

doi: 10.1109/IROS.2012.6386109. Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. FVD: A new metric for video generation. InICLR Workshop on Deep Generative Models for Highly Structured Data,

work page doi:10.1109/iros.2012.6386109 2012
[17]

Aligning human motion generation with human perceptions.arXiv preprint arXiv:2407.02272, 2024a

Haoru Wang, Wentao Zhu, Luyi Miao, Yishu Xu, Feng Gao, Qi Tian, and Yizhou Wang. Aligning human motion generation with human perceptions.arXiv preprint arXiv:2407.02272, 2024a. TanWang,LinjieLi,KevinLin,YuanhaoZhai,Chung-ChingLin,ZhengyuanYang,HanwangZhang, Zicheng Liu, and Lijuan Wang. DisCo: Disentangled control for realistic human dance generation. InP...

work page arXiv
[18]

Anchorweave: World- consistentvideogenerationwithretrievedlocalspatialmemories.arXivpreprintarXiv:2602.14941, 2026b

ZunWang,HanLin,JaehongYoon,JaeminCho,YueZhang,andMohitBansal. Anchorweave: World- consistentvideogenerationwithretrievedlocalspatialmemories.arXivpreprintarXiv:2602.14941, 2026b. David A Winter.Biomechanics and motor control of human movement. John wiley & sons,

work page arXiv
[19]

VisionReward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024a

Jiazheng Xu, Yu Huang, Jiale Cheng, Yuanming Yang, Jiajun Xu, Yuan Wang, Wenbo Duan, Shen Yang, Qunlin Jin, Shurun Li, et al. VisionReward: Fine-grained multi-dimensional human preference learning for image and video generation.arXiv preprint arXiv:2412.21059, 2024a. Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi...

work page arXiv
[20]

Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507,

work page arXiv
[21]

Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051,

Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Yue Ma, Haoyang Huang, Nan Duan, and Anyi Rao. Astrolabe: Steering forward-process reinforcement learning for distilled autoregressive video models.arXiv preprint arXiv:2603.17051,

work page arXiv
[22]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

15 Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, and Ziwei Liu. VBench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

RealisDance: Equip controllable character animation with realistic hands

Jingkai Zhou, Benzhi Wang, Weihua Chen, Jingqi Bai, Dongyang Li, Aixi Zhang, Hao Xu, Mingyang Yang, and Fan Wang. RealisDance: Equip controllable character animation with realistic hands. arXiv preprint arXiv:2409.06202,

work page arXiv
[24]

Causal forcing: Autoregressivediffusiondistillationdonerightforhigh-qualityreal-timeinteractivevideogeneration

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregressivediffusiondistillationdonerightforhigh-qualityreal-timeinteractivevideogeneration. arXiv preprint arXiv:2602.02214,

work page arXiv
[25]

The examples show complementary failures captured by kinematic feasibility, contact/balance feasibility, and dynamic feasibility

17 (a) Kinematic Errors A person stands still in a wide shot, repeatedly extending both arms forward and back ❌ Hands self-penetration A person lies and repeatedly opening and closing the top knee ❌ Extra arm A person lies and repeatedly opening and closing the top knee ❌ Impossible joint angle (b) Contact/Balance Errors A figure skater performs a Biellma...

work page 2026
[26]

We report the checkpoint at training step330

Unless otherwise specified, all experiments use the same configuration. We report the checkpoint at training step330. F Licenses for Existing Assets We show the licenses of all assets we use in Table 11 25 Table 11: Licenses of external assets, models, datasets, codebases, and evaluation tools used in this work. Asset / Method Website / Source License Wan...

work page 2025