Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning
Pith reviewed 2026-05-10 15:18 UTC · model grok-4.3
The pith
LLM agents construct executable event graphs that a 3D game engine runs to create videos from text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that separating the LLM's narrative reasoning from a programmatic backend that validates all simulator constraints produces GEST specifications that are executable by construction, yielding videos with 58 percent physical validity and 3.75 out of 5 semantic alignment in direct comparisons against neural baselines that achieve only 25 percent and 20 percent validity along with lower alignment scores.
What carries the argument
The Graph of Events in Space and Time (GEST), a structured formal specification of actors, actions, objects, and temporal constraints that is populated by a hierarchical Director and Scene Builder agent pair using validated tool calls to a state backend.
If this is right
- Agentic outputs win 79 percent of text comparisons and 74 percent of video comparisons against procedural baselines in LLM jury evaluations.
- Staged LLM refinement alone produces zero executable specifications in fifty attempts, showing the need for tool-enforced state management.
- Relation Subagents are required to fill logical and semantic edges that procedural methods leave empty, exercising the full capacity of the GEST representation.
- Engine execution guarantees every specification is valid, removing the semantic unreliability seen in direct pixel generation systems.
Where Pith is reading between the lines
- The GEST could serve as explicit ground-truth labels to train or fine-tune neural video models on physically consistent data.
- Users could edit the event graph directly to steer generation without re-prompting the entire system.
- The separation of planning and execution might extend to other domains that require both narrative flexibility and hard constraint satisfaction, such as robotic task planning.
Load-bearing premise
The 3D game engine can run the full range of complex interactions described in the GEST without adding simulation artifacts that would cancel the physical validity gains.
What would settle it
Generation of a multi-agent scene with precise object interactions where the engine output visibly violates the stated temporal or physical constraints in the GEST, such as incorrect collision responses or missing causal links.
Figures
read the original abstract
Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that inverts this paradigm: instead of generating pixels, the LLM constructs a formal Graph of Events in Space and Time (GEST) -- a structured specification of actors, actions, objects, and temporal constraints -- which is then executed deterministically in a 3D game engine. A staged LLM refinement pipeline fails entirely at this task (0 of 50 attempts produce an executable specification), motivating a fundamentally different architecture based on a separation of concerns: the LLM handles narrative planning through natural language reasoning, while a programmatic state backend enforces all simulator constraints through validated tool calls, guaranteeing that every generated specification is executable by construction. The system uses a hierarchical two-agent architecture -- a Director that plans the story and a Scene Builder that constructs individual scenes through a round-based state machine -- with dedicated Relation Subagents that populate the logical and semantic edge types of the GEST formalism that procedural generation leaves empty, making this the first approach to exercise the full expressive capacity of the representation. We evaluate in two stages: autonomous generation against procedural baselines via a 3-model LLM jury, where agentic narratives win 79% of text and 74% of video comparisons; and seeded generation where the same text is given to our system, VEO 3.1, and WAN 2.2, with human annotations showing engine-generated videos substantially outperform neural generators on physical validity (58% vs 25% and 20%) and semantic alignment (3.75/5 vs 2.33 and 1.50).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an agentic video generation framework that uses LLMs to construct a formal Graph of Events in Space and Time (GEST) specification of actors, actions, objects, and temporal constraints, which is then executed deterministically in a 3D game engine rather than synthesizing pixels directly. It describes a hierarchical architecture with a Director for narrative planning, a Scene Builder using a round-based state machine, and dedicated Relation Subagents to populate logical and semantic relations. The system is motivated by the failure of staged LLM refinement (0/50 executable outputs) and evaluated in two stages: autonomous generation against procedural baselines via a 3-model LLM jury (79% text and 74% video wins) and seeded generation against VEO 3.1 and WAN 2.2, where human annotations show engine outputs scoring 58% physical validity (vs. 25% and 20%) and 3.75/5 semantic alignment (vs. 2.33 and 1.50).
Significance. If the empirical comparisons hold under rigorous scrutiny, the work offers a substantive alternative to neural video generators by enforcing executability and physical constraints through symbolic planning and simulation. The separation of LLM-based narrative reasoning from programmatic constraint enforcement, along with the use of a full GEST representation, provides a falsifiable and reproducible pathway for complex scene generation that could improve reliability in applications requiring semantic and physical consistency.
major comments (3)
- [Evaluation] Evaluation section (human annotation results): The central claim of substantial outperformance on physical validity (58% vs. 25%/20%) and semantic alignment (3.75/5 vs. 2.33/1.50) is load-bearing, yet the manuscript provides no sample size, inter-annotator agreement, definition of physical validity criteria, or controls for prompt engineering and annotator bias. This directly affects the strength of support for the seeded-generation comparison to VEO 3.1 and WAN 2.2.
- [Method] Method section (Scene Builder and Relation Subagents): The architecture guarantees syntactic executability via tool calls and the programmatic state backend, but contains no quantitative audit or coverage analysis of simulation artifacts (e.g., collision resolution, secondary effects, or fidelity loss in multi-actor scenes) that the 3D engine may introduce when executing complex GEST specifications. Such artifacts would be invisible to the 'executable by construction' guarantee yet would undermine the reported physical-validity advantage.
- [Abstract and Evaluation] Abstract and Evaluation: The staged LLM refinement baseline is reported as failing in 0 of 50 attempts, but the manuscript does not detail the exact prompting strategy, failure modes, or how this baseline was constructed, making it difficult to assess whether the proposed hierarchical architecture's advantages are fairly isolated from prompt-engineering effects.
minor comments (2)
- [Introduction] The GEST formalism is referenced throughout but lacks an early formal definition, diagram, or edge-type enumeration that would clarify how the Relation Subagents populate the representation.
- [Evaluation] Figure captions and table headers could more explicitly link reported percentages to the exact comparison conditions (autonomous vs. seeded) to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional transparency and analysis will strengthen the manuscript. We address each major comment below and commit to revisions that improve the rigor of the evaluation and method sections without altering the core claims.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (human annotation results): The central claim of substantial outperformance on physical validity (58% vs. 25%/20%) and semantic alignment (3.75/5 vs. 2.33/1.50) is load-bearing, yet the manuscript provides no sample size, inter-annotator agreement, definition of physical validity criteria, or controls for prompt engineering and annotator bias. This directly affects the strength of support for the seeded-generation comparison to VEO 3.1 and WAN 2.2.
Authors: We agree that these details are necessary to fully support the reported results. In the revised manuscript we will expand the Evaluation section with the sample size used for human annotations, inter-annotator agreement statistics, the explicit criteria employed for physical validity judgments, and a description of controls including randomized video presentation order and blinded annotation procedures. These additions will be placed in a dedicated subsection on the human study protocol. revision: yes
-
Referee: [Method] Method section (Scene Builder and Relation Subagents): The architecture guarantees syntactic executability via tool calls and the programmatic state backend, but contains no quantitative audit or coverage analysis of simulation artifacts (e.g., collision resolution, secondary effects, or fidelity loss in multi-actor scenes) that the 3D engine may introduce when executing complex GEST specifications. Such artifacts would be invisible to the 'executable by construction' guarantee yet would undermine the reported physical-validity advantage.
Authors: This observation is correct; executability does not automatically preclude simulation-level artifacts. We will add a quantitative audit subsection to the Method section that reports coverage statistics on collision resolutions, secondary physics effects, and any observed fidelity loss across the evaluated multi-actor scenes. This analysis will be tied directly to the human physical-validity annotations to clarify the contribution of the GEST representation versus engine behavior. revision: yes
-
Referee: [Abstract and Evaluation] Abstract and Evaluation: The staged LLM refinement baseline is reported as failing in 0 of 50 attempts, but the manuscript does not detail the exact prompting strategy, failure modes, or how this baseline was constructed, making it difficult to assess whether the proposed hierarchical architecture's advantages are fairly isolated from prompt-engineering effects.
Authors: We acknowledge that greater detail on the baseline construction is required for fair comparison. The revised manuscript will expand both the Abstract and Evaluation sections to describe the precise prompting strategy used for staged refinement, a categorization of the observed failure modes across the 50 attempts, and the sampling procedure that aligned the baseline prompts with those used for the hierarchical system. This will help isolate the benefits of the Director-Scene Builder separation from prompt-engineering variations. revision: yes
Circularity Check
No significant circularity; claims rest on external empirical comparisons
full rationale
The paper's load-bearing results derive from two independent evaluation stages: an LLM jury comparing agentic outputs to procedural baselines (79%/74% win rates) and human annotations on seeded generation against VEO 3.1 and WAN 2.2 (58% vs 25%/20% physical validity; 3.75 vs 2.33/1.50 semantic alignment). These metrics are measured externally and do not reduce to quantities defined inside the system. The 'executable by construction' property follows directly from the programmatic state backend and validated tool calls, which is an architectural design choice rather than a self-referential definition of the performance claims. No equations, fitted parameters, or self-citation chains are invoked to derive the reported advantages. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
GEST (Graph of Events in Space and Time)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
NEWTON: Agentic Planning for Physically Grounded Video Generation
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
Reference graph
Works this paper leans on
-
[1]
Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983
James F Allen. Maintaining knowledge about temporal inter- vals.Communications of the ACM, 26(11):832–843, 1983. 3
work page 1983
-
[2]
Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion
Michael J Black, Priyanka Patel, Joachim Tesch, and Jin- long Yang. Bedlam: A synthetic dataset of bodies exhibit- ing detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023. 2
work page 2023
-
[3]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators
-
[4]
Nicolae Cudlenco, Mihai Masala, and Marius Leordeanu. [tiny paper] GEST-engine: Controllable multi-actor video synthesis with perfect spatiotemporal annotations. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling, 2026. 1, 2, 3, 6
work page 2026
-
[5]
Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, and Tat-Seng Chua. Dysen-vdm: Empowering dynamics- aware text-to-video diffusion with llms. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7641–7653, 2024. 2
work page 2024
- [6]
-
[7]
Storyagent: Cus- tomized storytelling video generation via multi-agent col- laboration
Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, and Xiaodan Liang. Storyagent: Cus- tomized storytelling video generation via multi-agent collab- oration.arXiv preprint arXiv:2411.04925, 2024. 1, 2
-
[8]
Action genome: Actions as compositions of spatio- temporal scene graphs
Jingwei Ji, Ranjay Krishna, Li Fei-Fei, and Juan Carlos Niebles. Action genome: Actions as compositions of spatio- temporal scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10236–10247, 2020. 2
work page 2020
-
[9]
Han Lin, Abhay Zala, Jaemin Cho, and Mohit Bansal. Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,
-
[10]
Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning
Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1430–1440, 2024. 2, 3
work page 2024
-
[11]
Mihai Masala and Marius Leordeanu. From vision to language through graph of events in space and time: An explainable self-supervised approach.arXiv preprint arXiv:2507.04815, 2025. 6
-
[12]
Explaining vision and language through graphs of events in space and time
Mihai Masala, Nicolae Cudlenco, Traian Rebedea, and Mar- ius Leordeanu. Explaining vision and language through graphs of events in space and time. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2826–2831, 2023. 1, 2, 3, 6
work page 2023
-
[13]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023. 2
work page 2023
-
[14]
Virtualhome: Simulating household activities via programs
Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, Tingwu Wang, Sanja Fidler, and Antonio Torralba. Virtualhome: Simulating household activities via programs. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 8494–8502, 2018. 2
work page 2018
-
[15]
Playing for data: Ground truth from computer games
Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. InEuropean conference on computer vision, pages 102–118. Springer, 2016. 2
work page 2016
-
[16]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video gen- erative models.arXiv preprint arXiv:2503.20314, 2025. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Mavis: A multi-agent framework for long-sequence video storytelling
Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, and Ning Yu. Mavis: A multi-agent framework for long-sequence video storytelling. InProceedings of the 19th Conference of the European Chapter of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 2273– 2295, 2026. 1, 2, 3
work page 2026
-
[18]
Zun Wang, Jialu Li, Han Lin, Jaehong Yoon, and Mohit Bansal. Dreamrunner: Fine-grained compositional story- to-video generation with retrieval-augmented motion adap- tation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 10503–10511, 2026. 1, 2
work page 2026
-
[19]
Autogen: Enabling next-gen llm applica- tions via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applica- tions via multi-agent conversations. InFirst conference on language modeling, 2024. 2
work page 2024
-
[20]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 6 10
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.