Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

arxiv: 2605.19728 · v1 · pith:F2USP5ADnew · submitted 2026-05-19 · 💻 cs.CV

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Abdul Mohaimen Al Radi , Kunyang Li , Yuzhang Shang , Mubarak Shah , Yu Tian This is my paper

Pith reviewed 2026-05-20 05:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords aerial video generationaction-conditioned diffusioninertial controlsphysics probedrone simulationvideo diffusion modelsLoRA finetuningAeroBench

0 comments p. Extension

pith:F2USP5AD Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{F2USP5AD}

Prints a linked pith:F2USP5AD badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A frozen Physics Probe supplies inertial consistency checks that let a pretrained video diffusion model generate aerial footage aligned with low-level acceleration and rotation commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to turn language-trained video generators into tools for embodied aerial AI by conditioning them on fine-grained inertial signals instead of text. It does this by streaming action tokens into a latent diffusion transformer and supervising LoRA updates with a frozen Physics Probe that was trained once on real video-IMU pairs. The probe supplies differentiable motion-consistency loss without ever decoding full videos. A new benchmark, AeroBench, quantifies success through Action Alignment Score and Physical Consistency Rate. If the approach holds, it supplies a cheap, scalable source of action-faithful drone videos that can stand in for costly real flights or simulators when training aerial agents.

Core claim

Aero-World converts a pretrained image-to-video diffusion model into a controllable aerial video generator by injecting sequences of translational acceleration and angular velocity through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video-IMU pairs, supplies differentiable inertial-consistency supervision during LoRA finetuning. On the introduced AeroBench, the method raises mean Action Alignment Score from 57.7 to 63.6, lowers FVD to 596.5, raises SSIM to 0.595, and raises Flow-IMU correlation to 0.44, outperforming action-only finetuning and the prior AirScape baseline.

What carries the argument

The frozen latent-space Physics Probe that delivers differentiable inertial-consistency supervision on video-IMU pairs without requiring video decoding during finetuning.

If this is right

Generated videos show higher agreement with commanded inertial actions as measured by the Action Alignment Score.
The method improves the quality-consistency trade-off relative to prior action-conditioned baselines.
AeroBench metrics can be used to compare any future action-conditioned aerial video generator.
The generated videos can serve as scalable proxy data for training or evaluating aerial agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probe-supervision pattern could be tested on ground-vehicle or manipulator video generation where low-level controls must be respected.
Running the generated videos through a downstream navigation planner would test whether higher AAS actually improves agent success rates.
Expanding the Physics Probe training set to more drone models and weather conditions could increase robustness of the supervision signal.

Load-bearing premise

A latent Physics Probe trained once on real video-IMU pairs can give reliable motion-consistency signals when kept frozen during later LoRA adaptation of a video generator.

What would settle it

Ablating the Physics Probe loss during finetuning and measuring no gain (or a drop) in Action Alignment Score and Flow-IMU correlation on AeroBench would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.19728 by Abdul Mohaimen Al Radi, Kunyang Li, Mubarak Shah, Yu Tian, Yuzhang Shang.

**Figure 2.** Figure 2: The proposed architecture. A pretrained diffusion backbone generates video latents [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Physics Probe accuracy vs. baselines. Per-axis classification accuracy over K=7 discretized bins. The blue indicates the accuracy of choosing a random bin uniformly, the red indicates the accuracy if always the majority bin is chosen. The Physics Probe substantially outperforms both random and majority-bin baselines across all six axes. LoRA finetuning. We finetune the diffusion backbone using Low-Rank Ada… view at source ↗

**Figure 4.** Figure 4: Visual Fidelity Trade-off. While action-only finetuning achieves the lowest FVD, our physics-regularized model (Ours) maintains superior perceptual quality compared to base models and SOTA competitors like AirScape, without sacrificing structural similarity (SSIM). 4.4 Auxiliary Independent Flow-IMU Validation To reduce probe-circularity, we introduce Flow-IMU, an independent RGB-space evaluator that maps … view at source ↗

**Figure 5.** Figure 5: Quantitative Benchmarking. Aero-World (Ours) improves mean action alignment and independent RGB-space Flow-IMU correlation, while maintaining low temporal instability compared with action-only finetuning [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative action-controlled flight results. We show seven uniformly spaced frames from 81-frame rollouts. Aero-World produces stable, action-faithful motion in both unseen environments and validation-set maneuvers. Full prompts and videos are provided in the supplementary material. 5 Conclusion We presented Aero-World, a lightweight framework for adapting pretrained video diffusion models to generate aer… view at source ↗

read the original abstract

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Aero-World adds action-token conditioning and a frozen latent Physics Probe to adapt video diffusion models for inertial-controlled aerial footage, with modest metric gains on a new benchmark but limited verification of the probe's reliability on generated outputs.

read the letter

The core idea is taking a pretrained image-to-video diffusion model, feeding it sequences of acceleration and angular velocity as action tokens, and using a frozen Physics Probe trained on real video-IMU pairs to add inertial consistency loss during LoRA fine-tuning. This setup aims to produce drone videos that better match low-level control signals without decoding full videos every step.

Referee Report

2 major / 2 minor

Summary. The paper introduces Aero-World, a method to adapt a pretrained latent diffusion transformer (image-to-video model) into an action-conditioned aerial video generator. It injects sequences of translational acceleration and angular velocity via an action-token stream, uses LoRA finetuning, and employs a frozen latent-space Physics Probe (trained independently on real video-IMU pairs) to supply differentiable inertial-consistency supervision without decoding videos. The authors also propose the AeroBench benchmark, which evaluates generated videos using Action Alignment Score (AAS) for agreement with commanded inertial actions and Physical Consistency Rate (PCR) for temporal stability. Experiments report gains on AeroBench (AAS 57.7 to 63.6), lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20) compared to action-only finetuning and AirScape.

Significance. If the central results hold, the work offers a practical route for injecting low-level inertial control into foundation video models for aerial domains, which could aid scalable training and evaluation of embodied aerial agents. The frozen-probe supervision mechanism and the AeroBench benchmark are concrete contributions that address a gap between language-conditioned video generation and controllable 6-DoF motion synthesis.

major comments (2)

[§3.2–3.3] §3.2–3.3: The claim that the frozen Physics Probe supplies reliable differentiable inertial-consistency supervision during LoRA finetuning rests on the unverified assumption that probe predictions remain accurate on the distribution of videos produced by the adapting generator. No ablation or error analysis is provided that measures probe accuracy (or gradient quality) on synthetic videos that differ in appearance statistics or ego-motion trajectories from the real video-IMU training pairs; the modest AAS improvement (57.7 → 63.6) could therefore arise from noisy or biased gradients rather than true inertial fidelity.
[§4.1 and Table 1] §4.1 and Table 1: The experimental comparison with AirScape and the action-only baseline lacks reported error bars, dataset split details, and full hyperparameter specifications for the Physics Probe and LoRA stages. Without these, it is difficult to assess whether the reported gains on AAS, FVD, SSIM, and Flow-IMU correlation are robust or sensitive to implementation choices.

minor comments (2)

[§3] The notation for the action-token stream and the precise architecture of the Physics Probe (e.g., input dimensionality, latent-space projection) should be defined more explicitly with equations or a diagram to aid reproducibility.
[§4] AeroBench metric definitions (AAS and PCR) are introduced in §4 but would benefit from a short pseudocode or explicit formula in the main text rather than only in the supplement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and outline the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2–3.3] §3.2–3.3: The claim that the frozen Physics Probe supplies reliable differentiable inertial-consistency supervision during LoRA finetuning rests on the unverified assumption that probe predictions remain accurate on the distribution of videos produced by the adapting generator. No ablation or error analysis is provided that measures probe accuracy (or gradient quality) on synthetic videos that differ in appearance statistics or ego-motion trajectories from the real video-IMU training pairs; the modest AAS improvement (57.7 → 63.6) could therefore arise from noisy or biased gradients rather than true inertial fidelity.

Authors: We agree that an explicit validation of the Physics Probe on generated videos would provide stronger support for the supervision mechanism. The probe is trained on real video-IMU pairs and kept frozen precisely to preserve its learned motion priors, and the consistent gains across AAS, FVD, SSIM, and Flow-IMU correlation suggest the gradients are useful. Nevertheless, we did not quantify probe error or gradient quality on the adapting generator’s outputs. In the revised manuscript we will add an ablation that measures the probe’s prediction accuracy and the resulting gradient norms on a held-out set of videos sampled from the finetuned model. revision: yes
Referee: [§4.1 and Table 1] §4.1 and Table 1: The experimental comparison with AirScape and the action-only baseline lacks reported error bars, dataset split details, and full hyperparameter specifications for the Physics Probe and LoRA stages. Without these, it is difficult to assess whether the reported gains on AAS, FVD, SSIM, and Flow-IMU correlation are robust or sensitive to implementation choices.

Authors: We acknowledge that the current experimental section omits several details required for full reproducibility and robustness assessment. In the revised version we will report mean and standard deviation over at least three independent runs with different random seeds, explicitly describe the train/validation/test splits used for both AeroBench and the Physics Probe training data, and provide complete hyperparameter tables for the probe pre-training stage and the subsequent LoRA adaptation (including learning rates, rank, alpha, and training steps). revision: yes

Circularity Check

0 steps flagged

No significant circularity; supervision and evaluation are externally grounded

full rationale

The derivation relies on a Physics Probe trained independently on real video-IMU pairs to supply inertial-consistency loss during LoRA adaptation of a pretrained diffusion model. AeroBench evaluation metrics (AAS, PCR) and reported gains (e.g., AAS 57.7→63.6) are measured on held-out generated videos against commanded actions, not by construction from the same fitted quantities. No self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chain appears in the provided derivation; the central claim remains falsifiable against external real IMU data and the new benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of the Physics Probe and the assumption that LoRA adaptation preserves the benefits of the pretrained model while incorporating inertial signals.

axioms (1)

domain assumption Pretrained image-to-video latent diffusion transformer can be effectively adapted via LoRA while preserving visual quality
Invoked when describing the conversion of the pretrained model into a controllable generator.

invented entities (2)

Physics Probe no independent evidence
purpose: Provide differentiable inertial-consistency supervision in latent space from real video-IMU pairs
Introduced as a frozen component trained independently on real data to avoid expensive video decoding.
AeroBench no independent evidence
purpose: Benchmark for measuring action alignment and physical consistency of generated aerial videos
Proposed in the paper with AAS and PCR metrics.

pith-pipeline@v0.9.0 · 5860 in / 1348 out tokens · 53421 ms · 2026-05-20T05:38:45.513854+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A frozen latent-space Physics Probe, trained independently on real video–IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AAS and PCR metrics on AeroBench for 6-DoF action alignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 19 internal anchors

[1]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Drivinggen: A comprehensive benchmark for generative video world models in autonomous driving.arXiv preprint arXiv:2601.01528,

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, and Steven L Waslander. Drivinggen: A comprehensive benchmark for generative video world models in autonomous driving.arXiv preprint arXiv:2601.01528,

work page arXiv
[7]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11,

work page 2025
[12]

Virtually being: Customizing camera- controllable video diffusion models with multi-view performance captures.arXiv preprint arXiv:2510.14179,

Yuancheng Xu, Wenqi Xian, Li Ma, Julien Philip, Ahmet Levent Ta¸ sel, Yiwei Zhao, Ryan Burgert, Mingming He, Oliver Hermann, Oliver Pilarski, et al. Virtually being: Customizing camera- controllable video diffusion models with multi-view performance captures.arXiv preprint arXiv:2510.14179,

work page arXiv
[13]

Egosim: Egocentric world simulator for embodied interaction generation.arXiv preprint arXiv:2604.01001, 2026

Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, and Xudong Xu. Egosim: Egocentric world simulator for embodied interaction generation.arXiv preprint arXiv:2604.01001,

work page arXiv
[14]

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

10 Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Imagine2act: Leveraging object-action motion consistency from imagined goals for robotic manipulation.arXiv preprint arXiv:2509.17125,

Liang Heng, Jiadong Xu, Yiwen Wang, Xiaoqi Li, Muhe Cai, Yan Shen, Juan Zhu, Guanghui Ren, and Hao Dong. Imagine2act: Leveraging object-action motion consistency from imagined goals for robotic manipulation.arXiv preprint arXiv:2509.17125,

work page arXiv
[16]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385,

work page internal anchor Pith review arXiv
[17]

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038,

work page internal anchor Pith review arXiv
[18]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

doi: 10.1109/LRA.2026. 3653405. Jeffrey Delmerico, Titus Cieslewski, Henri Rebecq, Matthias Faessler, and Davide Scaramuzza. Are we ready for autonomous drone racing? the UZH-FPV drone racing dataset. InIEEE Int. Conf. Robot. Autom. (ICRA),

work page doi:10.1109/lra.2026 2026
[20]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2308.08089 , year=

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089,

work page arXiv
[25]

doi: 10.1145/3746027

ACM. doi: 10.1145/3746027. 3758180. URLhttps://doi.org/10.1145/3746027.3758180. 11 Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page doi:10.1145/3746027
[26]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809,

Fawad Mehboob, Monijesu James, Amir Habel, Jeffrin Sam, Miguel Altamirano Cabrera, and Dzmitry Tsetserukou. Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809,

work page arXiv
[29]

Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572,

Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, and Dzmitry Tsetserukou. Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572,

work page arXiv

[1] [1]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion. arXiv preprint arXiv:2501.00103,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Drivinggen: A comprehensive benchmark for generative video world models in autonomous driving.arXiv preprint arXiv:2601.01528,

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, and Steven L Waslander. Drivinggen: A comprehensive benchmark for generative video world models in autonomous driving.arXiv preprint arXiv:2601.01528,

work page arXiv

[7] [7]

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness.arXiv preprint arXiv:2503.21755,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Context as memory: Scene-consistent interactive long video generation with memory retrieval

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Xihui Liu. Context as memory: Scene-consistent interactive long video generation with memory retrieval. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–11,

work page 2025

[12] [12]

Virtually being: Customizing camera- controllable video diffusion models with multi-view performance captures.arXiv preprint arXiv:2510.14179,

Yuancheng Xu, Wenqi Xian, Li Ma, Julien Philip, Ahmet Levent Ta¸ sel, Yiwei Zhao, Ryan Burgert, Mingming He, Oliver Hermann, Oliver Pilarski, et al. Virtually being: Customizing camera- controllable video diffusion models with multi-view performance captures.arXiv preprint arXiv:2510.14179,

work page arXiv

[13] [13]

Egosim: Egocentric world simulator for embodied interaction generation.arXiv preprint arXiv:2604.01001, 2026

Jinkun Hao, Mingda Jia, Ruiyan Wang, Xihui Liu, Ran Yi, Lizhuang Ma, Jiangmiao Pang, and Xudong Xu. Egosim: Egocentric world simulator for embodied interaction generation.arXiv preprint arXiv:2604.01001,

work page arXiv

[14] [14]

WorldMark: A Unified Benchmark Suite for Interactive Video World Models

10 Xiaojie Xu, Zhengyuan Lin, Kang He, Yukang Feng, Xiaofeng Mao, Yuanyang Yin, Kaipeng Zhang, and Yongtao Ge. Worldmark: A unified benchmark suite for interactive video world models.arXiv preprint arXiv:2604.21686,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Imagine2act: Leveraging object-action motion consistency from imagined goals for robotic manipulation.arXiv preprint arXiv:2509.17125,

Liang Heng, Jiadong Xu, Yiwen Wang, Xiaoqi Li, Muhe Cai, Yan Shen, Juan Zhu, Guanghui Ren, and Hao Dong. Imagine2act: Leveraging object-action motion consistency from imagined goals for robotic manipulation.arXiv preprint arXiv:2509.17125,

work page arXiv

[16] [16]

How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, and Jiashi Feng. How far is video generation from world model: A physical law perspective.arXiv preprint arXiv:2411.02385,

work page internal anchor Pith review arXiv

[17] [17]

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, and Robert Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038,

work page internal anchor Pith review arXiv

[18] [18]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation.arXiv preprint arXiv:2307.06942,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

doi: 10.1109/LRA.2026. 3653405. Jeffrey Delmerico, Titus Cieslewski, Henri Rebecq, Matthias Faessler, and Davide Scaramuzza. Are we ready for autonomous drone racing? the UZH-FPV drone racing dataset. InIEEE Int. Conf. Robot. Autom. (ICRA),

work page doi:10.1109/lra.2026 2026

[20] [20]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Latte: Latent Diffusion Transformer for Video Generation

Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual description.arXiv preprint arXiv:2210.02399,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2308.08089 , year=

Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory.arXiv preprint arXiv:2308.08089,

work page arXiv

[25] [25]

doi: 10.1145/3746027

ACM. doi: 10.1145/3746027. 3758180. URLhttps://doi.org/10.1145/3746027.3758180. 11 Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page doi:10.1145/3746027

[26] [26]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent.arXiv preprint arXiv:2205.06175,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809,

Fawad Mehboob, Monijesu James, Amir Habel, Jeffrin Sam, Miguel Altamirano Cabrera, and Dzmitry Tsetserukou. Dronevla: Vla based aerial manipulation.arXiv preprint arXiv:2601.13809,

work page arXiv

[29] [29]

Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572,

Valerii Serpiva, Artem Lykov, Artyom Myshlyaev, Muhammad Haris Khan, Ali Alridha Abdulkarim, Oleg Sautenkov, and Dzmitry Tsetserukou. Racevla: Vla-based racing drone navigation with human-like behaviour.arXiv preprint arXiv:2503.02572,

work page arXiv