pith. sign in

arxiv: 2607.01766 · v1 · pith:HR4MUKAQnew · submitted 2026-07-02 · 💻 cs.AI

SimWorlds: A Multi-Agent System for Dynamic 3D Scene Creation

Pith reviewed 2026-07-03 13:56 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent systemstext-to-3D generationdynamic scene synthesisprocedural modelingphysics simulationBlender scripting4D content creationbenchmark evaluation
0
0 comments X

The pith

SimWorlds produces dynamic editable 4D scenes from text via a multi-agent planner-coder-reviewer workflow in Blender.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a multi-agent system that translates natural language into procedural dynamic scenes containing flowing liquids, emitting particles, cascading rigid bodies, and moving mechanisms. Prior text-to-scene methods handle only static outputs, but dynamic cases require simultaneous coordination of layout, multiple physics solvers, timing, cameras, and lighting. SimWorlds supplies Blender-specific knowledge, an ordered construction sequence, a deterministic scene verifier, and runtime inspection tools that detect failures invisible in rendered video. The authors introduce 4DBuildBench to measure both visual and physical quality. Experiments show the system exceeds earlier dynamic Blender baselines.

Core claim

SimWorlds is a multi-agent framework that produces dynamic, editable 4D scenes from text, with Blender-specific procedural knowledge, a planner-coder-reviewer workflow driving a fixed ordered sequence of construction stages, a layered scene protocol enforced by a deterministic verifier, and a runtime-state inspection tool suite that catches mechanism failures the rendered image cannot reveal.

What carries the argument

Planner-coder-reviewer workflow that drives a fixed ordered sequence of construction stages, combined with a layered scene protocol and runtime-state inspection tools.

If this is right

  • Generated scenes serve directly as editable training data for video generation models.
  • Physical consistency can be verified without relying solely on rendered video frames.
  • The same workflow structure extends to other procedural modeling environments beyond Blender.
  • Benchmarks like 4DBuildBench become necessary to separate visual appearance from motion correctness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integration with reinforcement-learning environments could supply infinite varied physics trajectories for policy training.
  • The deterministic verifier layer could be reused as a general filter for any LLM-generated simulation code.
  • Extending the construction stages to include sound or material fracture would require only additions to the protocol without changing the agent roles.

Load-bearing premise

The planner-coder-reviewer workflow, deterministic verifier, and runtime inspection tools can reliably coordinate spatial layout, multiple physics solvers, temporal sequencing, camera, and lighting into physically consistent dynamic scenes.

What would settle it

A generated scene that passes all verifier checks and runtime inspections yet shows objects intersecting, incorrect particle trajectories, or timing mismatches when simulated forward from the initial state.

Figures

Figures reproduced from arXiv: 2607.01766 by Chunjiang Liu, Haoyu Chen, L\'aszl\'o A. Jeni, Ming-Hsuan Yang, Xiaoyuan Wang, Yizhou Zhao.

Figure 1
Figure 1. Figure 1: SimWorlds turns text into dynamic, editable 3D Blender scenes. Given a natural￾language prompt, a planner, coder, and reviewer cooperate to emit a .blend whose geometry, materials, lighting, camera, and motion all remain controllable for downstream editing and reuse. Abstract LLM agents are increasingly used to translate natural language into 3D scenes in a procedural way, but existing systems focus on sta… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline overview. The planner compiles the prompt into a single scene plan; construction then proceeds through a fixed stage sequence, each stage running the coder, a deterministic verifier (scene protocol, §3.2), and a reviewer, with failed checks triggering bounded local retries. Once all stages close, the scene is rendered as a frame-sequence video. image, and the scene is assembled through a fixed, ch… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison across four dynamic sequences: (a) a domino cascade in a child’s [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

LLM agents are increasingly used to translate natural language into 3D scenes in a procedural way, but existing systems focus on static output. Dynamic 4D scenes from text alone, in which liquids flow, particles emit, rigid bodies cascade, and articulated mechanisms move, remain largely unexplored despite their value as editable content and as physics-grounded training data for video generation and embodied AI. Two challenges set the dynamic case apart from static text-to-scene work: an agent must jointly coordinate spatial layout, multiple physics solvers, temporal sequencing, camera, and lighting in a single coherent scene, and verifying motion correctness from rendered video is fundamentally harder than judging a single image. We present SimWorlds: a multi-agent framework that produces dynamic, editable 4D scenes from text, with Blender-specific procedural knowledge, a planner-coder-reviewer workflow driving a fixed ordered sequence of construction stages, a layered scene protocol enforced by a deterministic verifier, and a runtime-state inspection tool suite that catches mechanism failures the rendered image cannot reveal. We also introduce 4DBuildBench, a benchmark for assessing both visual fidelity and physical consistency of the procedural dynamic 3D scenes generated from text prompts. Experiments show that SimWorlds outperforms prior dynamic Blender generation baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SimWorlds, a multi-agent LLM framework that generates dynamic, editable 4D scenes in Blender from text prompts. It incorporates Blender-specific procedural knowledge, a planner-coder-reviewer workflow that follows a fixed ordered sequence of construction stages, a layered scene protocol with a deterministic verifier, and a suite of runtime-state inspection tools to detect failures not visible in rendered video. The work also presents the 4DBuildBench benchmark for evaluating visual fidelity and physical consistency, and reports that SimWorlds outperforms prior dynamic Blender generation baselines.

Significance. If the coordination mechanisms prove reliable, the framework would advance text-to-dynamic-scene generation by addressing the joint challenges of spatial layout, multiple physics solvers, temporal sequencing, camera, and lighting. The introduction of 4DBuildBench as an independent benchmark and the runtime-state inspection tools represent concrete contributions that could support reproducible evaluation and physics-grounded data generation for video models and embodied AI.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (planner-coder-reviewer workflow): the fixed ordered sequence of construction stages provides no described mechanism for revising an early stage (e.g., initial rigid-body placement) once a later stage reveals an inconsistency with fluid or particle dynamics. This directly bears on the central claim that the system reliably coordinates multiple physics solvers and temporal sequencing, as the verifier can only accept or reject the current stage output.
  2. [§4, 4DBuildBench] §4 (experiments) and 4DBuildBench description: the abstract asserts outperformance on the new benchmark, yet the provided text supplies no quantitative results, error bars, or per-category breakdown of visual fidelity versus physical consistency scores; without these, the claim that the runtime-state tools solve the motion-verification challenge cannot be assessed.
minor comments (2)
  1. [§3] Notation for the layered scene protocol is introduced without a compact tabular summary of layer dependencies; a small table would improve clarity.
  2. [Abstract, §4] The benchmark prompt set size and diversity statistics are not stated explicitly in the abstract or early sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the workflow design and the presentation of experimental results. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (planner-coder-reviewer workflow): the fixed ordered sequence of construction stages provides no described mechanism for revising an early stage (e.g., initial rigid-body placement) once a later stage reveals an inconsistency with fluid or particle dynamics. This directly bears on the central claim that the system reliably coordinates multiple physics solvers and temporal sequencing, as the verifier can only accept or reject the current stage output.

    Authors: The workflow intentionally uses a fixed ordered sequence of stages to enforce systematic construction, with the layered scene protocol and deterministic verifier ensuring consistency checks at each step. The runtime-state inspection tools are designed to surface issues not visible in video renders. We agree, however, that the current description lacks an explicit mechanism for revising decisions from early stages (such as rigid-body placement) when later stages expose inconsistencies with fluid or particle dynamics. In the revision we will add a dedicated paragraph clarifying this design limitation, its impact on the coordination claim, and how the inspection tools partially mitigate propagation of errors; we will also outline a possible extension for limited cross-stage refinement. revision: partial

  2. Referee: [§4, 4DBuildBench] §4 (experiments) and 4DBuildBench description: the abstract asserts outperformance on the new benchmark, yet the provided text supplies no quantitative results, error bars, or per-category breakdown of visual fidelity versus physical consistency scores; without these, the claim that the runtime-state tools solve the motion-verification challenge cannot be assessed.

    Authors: Section 4 presents quantitative comparisons on 4DBuildBench, but we acknowledge that error bars, per-category breakdowns separating visual fidelity from physical consistency, and explicit linkage to the runtime-state tools are not sufficiently prominent. In the revised manuscript we will add a summary table with these details, include error bars on all reported metrics, and provide a per-category analysis to directly support the abstract claim and demonstrate the contribution of the inspection tools. revision: yes

Circularity Check

0 steps flagged

No circularity detected in system description or benchmark

full rationale

The paper presents an engineering system (multi-agent planner-coder-reviewer workflow, layered protocol, deterministic verifier, runtime inspection tools) built on external Blender, plus an independently introduced benchmark 4DBuildBench. No equations, fitted parameters, self-definitional claims, or load-bearing self-citations appear in the provided text; performance is asserted via comparison to external baselines rather than internal tautologies. The fixed-sequence design noted by the skeptic is a potential limitation but does not constitute circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete information on free parameters, axioms, or invented entities; all claims remain at the level of system description.

pith-pipeline@v0.9.1-grok · 5776 in / 1229 out tokens · 34813 ms · 2026-07-03T13:56:34.331493+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    DreamFusion: Text-to-3d using 2d diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. DreamFusion: Text-to-3d using 2d diffusion. InICLR, 2023. 2, 3

  2. [2]

    LRM: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. LRM: Large reconstruction model for single image to 3d. InICLR, 2024. 3

  3. [3]

    LGM: Large multi-view Gaussian model for high-resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. LGM: Large multi-view Gaussian model for high-resolution 3d content creation. InECCV, 2024. 2, 3

  4. [4]

    Ross, Cordelia Schmid, and Alireza Fathi

    Ziniu Hu, Ahmet Iscen, Aashi Jain, Thomas Kipf, Yisong Yue, David A. Ross, Cordelia Schmid, and Alireza Fathi. SceneCraft: An LLM agent for synthesizing 3d scene as Blender code. arXiv:2403.01248, 2024. 2, 3, 6, 7

  5. [5]

    arXiv preprint arXiv:2310.12945 , year=

    Chunyi Sun et al. 3D-GPT: Procedural 3d modeling with large language models. arXiv:2310.12945, 2023. 3

  6. [6]

    LL3M: Large language 3D modelers.arXiv:2508.08228, 2025

    Sining Lu, Guan Chen, Nam Anh Dinh, Itai Lang, Ari Holtzman, and Rana Hanocka. LL3M: Large language 3D modelers.arXiv:2508.08228, 2025. 3, 6, 7

  7. [7]

    Holodeck: Language guided generation of 3d embodied AI environments

    Yue Yang, Fan-Yun Sun, Luca Weihs, Eli VanderBilt, Alvaro Herrasti, Winson Han, Jiajun Wu, Nick Haber, Ranjay Krishna, Lingjie Liu, Chris Callison-Burch, Mark Yatskar, Aniruddha Kembhavi, and Christopher Clark. Holodeck: Language guided generation of 3d embodied AI environments. InCVPR, 2024. 7

  8. [8]

    Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

    Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, and Haiwen Feng. VIGA: Vision-as-inverse-graphics agent via interleaved multimodal reasoning.arXiv:2601.11109, 2026. 2, 3, 6, 7, 8, 9, 15, 20

  9. [9]

    BlenderAlchemy: Editing 3d graphics with vision-language models.arXiv:2404.17672, 2024

    Ian Huang, Guandao Yang, and Leonidas Guibas. BlenderAlchemy: Editing 3d graphics with vision-language models.arXiv:2404.17672, 2024. 3

  10. [10]

    Infinite photorealistic worlds using procedural generation

    Alexander Raistrick et al. Infinite photorealistic worlds using procedural generation. InCVPR,

  11. [11]

    ProcTHOR: Large-scale embodied AI using procedural generation

    Matt Deitke, Eli VanderBilt, Alvaro Herrasti, Luca Weihs, Kiana Ehsani, Jordi Salvador, Winson Han, Eric Kolve, Aniruddha Kembhavi, and Roozbeh Mottaghi. ProcTHOR: Large-scale embodied AI using procedural generation. InNeurIPS, 2022. 3

  12. [12]

    AirSim360: A panoramic simulation platform within drone view

    Xian Ge, Yuling Pan, Yuhang Zhang, Xiang Li, Weijun Zhang, Dizhe Zhang, Zhaoliang Wan, Xin Lin, Xiangkai Zhang, Juntao Liang, Jason Li, Wenjie Jiang, Bo Du, Ming-Hsuan Yang, and Lu Qi. AirSim360: A panoramic simulation platform within drone view. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

  13. [13]

    Depth any panoramas: A foundation model for panoramic depth estimation

    Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, and Lu Qi. Depth any panoramas: A foundation model for panoramic depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 3

  14. [14]

    ATISS: Autoregressive transformers for indoor scene synthesis

    Despoina Paschalidou et al. ATISS: Autoregressive transformers for indoor scene synthesis. In NeurIPS, 2021. 3

  15. [15]

    Ctrl-Room: Controllable text-to-3d room meshes generation with layout constraints.arXiv:2310.03602, 2023

    Chuan Fang et al. Ctrl-Room: Controllable text-to-3d room meshes generation with layout constraints.arXiv:2310.03602, 2023

  16. [16]

    LayoutGPT: Compositional visual planning and generation with large language models

    Weixi Feng, Wanrong Zhu, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. LayoutGPT: Compositional visual planning and generation with large language models. InNeurIPS, 2023

  17. [17]

    InstructScene: Instruction-driven 3d indoor scene synthesis with semantic graph prior

    Chenguo Lin et al. InstructScene: Instruction-driven 3d indoor scene synthesis with semantic graph prior. InICLR, 2024

  18. [18]

    SceneTeller: Language-to-3d scene generation

    Ba¸ sak Melis Öcal et al. SceneTeller: Language-to-3d scene generation. InECCV, 2024. 10

  19. [19]

    DiffuScene: Denoising diffusion models for generative indoor scene synthesis

    Jiapeng Tang et al. DiffuScene: Denoising diffusion models for generative indoor scene synthesis. InCVPR, 2024. 3

  20. [20]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023. 3

  21. [21]

    Objaverse-XL: A universe of 10M+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, et al. Objaverse-XL: A universe of 10M+ 3d objects. InNeurIPS, 2023. 3

  22. [22]

    BlenderGym: Bench- marking foundational model systems for graphics editing

    Yunqi Gu, Ian Huang, Jihyeon Je, Guandao Yang, and Leonidas Guibas. BlenderGym: Bench- marking foundational model systems for graphics editing. InCVPR, 2025. Highlight. 3, 7

  23. [23]

    Magic3D: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-resolution text-to-3d content creation. InCVPR, 2023. 3

  24. [24]

    Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3d content creation

    Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3d content creation. InICCV, 2023

  25. [25]

    Pro- lificDreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificDreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. InNeurIPS, 2023. 3

  26. [26]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. In- stantMesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruc- tion models.arXiv:2404.07191, 2024. 3

  27. [27]

    3d Gaussian Splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d Gaussian Splatting for real-time radiance field rendering. InSIGGRAPH, 2023

  28. [28]

    DreamGaussian: Generative Gaussian Splatting for efficient 3d content creation

    Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. DreamGaussian: Generative Gaussian Splatting for efficient 3d content creation. InICLR, 2024

  29. [29]

    GaussianDreamer: Fast generation from text to 3d Gaussians by bridging 2d and 3d diffusion models

    Taoran Yi, Jiemin Fang, Junjie Wang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. GaussianDreamer: Fast generation from text to 3d Gaussians by bridging 2d and 3d diffusion models. InCVPR, 2024

  30. [30]

    Text-to-3d using Gaussian Splatting

    Zilong Chen, Feng Wang, Yikai Wang, and Huaping Liu. Text-to-3d using Gaussian Splatting. InCVPR, 2024

  31. [31]

    GraphDreamer: Compositional 3d scene synthesis from scene graphs

    Gege Gao et al. GraphDreamer: Compositional 3d scene synthesis from scene graphs. InCVPR,

  32. [32]

    Zero-1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. InICCV, 2023. 3

  33. [33]

    MVDream: Multi-view diffusion for 3d generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3d generation. InICLR, 2024

  34. [34]

    Wonder3D: Single image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3D: Single image to 3d using cross-domain diffusion. InCVPR, 2024. 3

  35. [35]

    Set-the-scene: Global-local training for generating controllable NeRF scenes

    Dana Cohen-Bar et al. Set-the-scene: Global-local training for generating controllable NeRF scenes. InICCV, 2023. 3

  36. [36]

    SceneWiz3D: Towards text-guided 3d scene composition

    Qihang Zhang et al. SceneWiz3D: Towards text-guided 3d scene composition. InCVPR, 2024

  37. [37]

    Text2Room: Extracting textured 3d meshes from 2d text-to-image models

    Lukas Höllein et al. Text2Room: Extracting textured 3d meshes from 2d text-to-image models. InICCV, 2023

  38. [38]

    SceneScape: Text-driven consistent scene generation

    Rafail Fridman et al. SceneScape: Text-driven consistent scene generation. InNeurIPS, 2023. 11

  39. [39]

    GALA3D: Towards text-to-3d complex scene generation via layout-guided generative Gaussian Splatting

    Xiaoyu Zhou et al. GALA3D: Towards text-to-3d complex scene generation via layout-guided generative Gaussian Splatting. InICML, 2024

  40. [40]

    Director3D: Real-world camera trajectory and 3d scene generation from text

    Xinyang Li et al. Director3D: Real-world camera trajectory and 3d scene generation from text. InNeurIPS, 2024. 3

  41. [41]

    Text-to-4d dynamic scene generation

    Uriel Singer, Shelly Sheynin, Adam Polyak, Oron Ashual, Iurii Makarov, Filippos Kokkinos, Naman Goyal, Andrea Vedaldi, Devi Parikh, Justin Johnson, and Yaniv Taigman. Text-to-4d dynamic scene generation. InICML, 2023. 3

  42. [42]

    4D-fy: Text-to-4d generation using hybrid score distillation sampling

    Sherwin Bahmani et al. 4D-fy: Text-to-4d generation using hybrid score distillation sampling. InCVPR, 2024

  43. [43]

    Dreamgaussian4d: Generative 4d gaussian splatting,

    Jiawei Ren et al. DreamGaussian4D: Generative 4d Gaussian Splatting.arXiv:2312.17142, 2024

  44. [44]

    Comp4D: LLM-guided compositional 4d scene generation.arXiv:2403.16993, 2024

    Dejia Xu et al. Comp4D: LLM-guided compositional 4d scene generation.arXiv:2403.16993, 2024

  45. [45]

    L4GM: Large 4d Gaussian reconstruction model

    Jiawei Ren et al. L4GM: Large 4d Gaussian reconstruction model. InNeurIPS, 2024

  46. [46]

    TC4D: Trajectory-conditioned text-to-4d generation

    Sherwin Bahmani et al. TC4D: Trajectory-conditioned text-to-4d generation. InECCV, 2024

  47. [47]

    Diffusion4D: Fast spatial-temporal consistent 4d generation via video diffusion models

    Hanwen Liang et al. Diffusion4D: Fast spatial-temporal consistent 4d generation via video diffusion models. InECCV, 2024

  48. [48]

    Sv4d: Dynamic 3d contentgenerationwithmulti-frameandmulti-viewconsistency.arXivpreprintarXiv:2407.17470, 2024

    Yiming Xie et al. SV4D: Dynamic 3d content generation with multi-frame and multi-view consistency.arXiv:2407.17470, 2024

  49. [49]

    Xiaoyuan Wang, Yizhou Zhao, Botao Ye, Xiaojun Shan, Weijie Lyu, Lu Qi, Kelvin C. K. Chan, Yinxiao Li, and Ming-Hsuan Yang. HoliGS: Holistic gaussian splatting for embodied view synthesis. InAdvances in Neural Information Processing Systems (NeurIPS), 2025. 3

  50. [50]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 3

  51. [51]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. InNeurIPS, 2022

  52. [52]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv:2311.15127, 2023

  53. [53]

    Make-a-video: Text-to-video generation without text-video data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. InICLR, 2023

  54. [54]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, et al. Video generation models as world simulators

  55. [55]

    OpenAI Technical Report

  56. [56]

    OmniRoam: World wandering via long-horizon panoramic video generation

    Yuheng Liu, Xin Lin, Xinke Li, Baihan Yang, Chen Wang, Kalyan Sunkavalli, Yannick Hold- Geoffroy, Hao Tan, Kai Zhang, Xiaohui Xie, Zifan Shi, and Yiwei Hu. OmniRoam: World wandering via long-horizon panoramic video generation. InACM SIGGRAPH, 2026

  57. [57]

    PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World

    Changpeng Wang, Xin Lin, Junhan Liu, Yuheng Liu, Zhen Wang, Donglian Qi, Yunfeng Yan, and Xi Chen. PanoWorld: Towards spatial supersensing in 360◦ panorama world.arXiv preprint arXiv:2605.13169, 2026. 3

  58. [58]

    PhysDreamer: Physics-based interaction with 3d objects via video generation.arXiv:2404.13026, 2024

    Tianyuan Zhang et al. PhysDreamer: Physics-based interaction with 3d objects via video generation.arXiv:2404.13026, 2024. 3

  59. [59]

    PhysGaussian: Physics-integrated 3d Gaussians for generative dynamics

    Tianyi Xie et al. PhysGaussian: Physics-integrated 3d Gaussians for generative dynamics. In CVPR, 2024. 3 12

  60. [60]

    Jeni, Min Xu, and Yizhou Zhao

    Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, László A. Jeni, Min Xu, and Yizhou Zhao. MOSIV: Multi-object system identification from videos. InInternational Conference on Learning Representations (ICLR), 2026. 3

  61. [61]

    MASIV: Toward material-agnostic system identification from videos

    Yizhou Zhao, Haoyu Chen, Chunjiang Liu, Zhenyang Li, Charles Herrmann, Junhwa Hur, Yinxiao Li, Ming-Hsuan Yang, Bhiksha Raj, and Min Xu. MASIV: Toward material-agnostic system identification from videos. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 3

  62. [62]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InCVPR, 2022. 3

  63. [63]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InICLR, 2023. 3

  64. [64]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InNeurIPS, 2023. 3

  65. [65]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan et al. Self-refine: Iterative refinement with self-feedback. InNeurIPS, 2023. 3

  66. [66]

    Large language models cannot self-correct reasoning yet

    Jie Huang et al. Large language models cannot self-correct reasoning yet. InICLR, 2024. 3

  67. [67]

    V oyager: An open-ended embodied agent with large language models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. InNeurIPS, 2023. 3

  68. [68]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick et al. Toolformer: Language models can teach themselves to use tools. InNeurIPS,

  69. [69]

    CRITIC: Large language models can self-correct with tool-interactive critiquing

    Zhibin Gou et al. CRITIC: Large language models can self-correct with tool-interactive critiquing. InICLR, 2024. 3

  70. [70]

    Executable code actions elicit better LLM agents

    Xingyao Wang et al. Executable code actions elicit better LLM agents. InICML, 2024. 3

  71. [71]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv:2308.08155, 2023. 3

  72. [72]

    MetaGPT: Meta programming for a multi-agent collaborative framework

    Sirui Hong et al. MetaGPT: Meta programming for a multi-agent collaborative framework. In ICLR, 2024. 3

  73. [73]

    ChatDev: Communicative agents for software development

    Chen Qian et al. ChatDev: Communicative agents for software development. InACL, 2024. 3

  74. [74]

    Effective context engineering for AI agents, 2025

    Anthropic. Effective context engineering for AI agents, 2025. URLhttps://www.anthropic. com/engineering/effective-context-engineering-for-ai-agents. 3

  75. [75]

    A Survey of Context Engineering for Large Language Models

    Lingrui Mei, Jiayu Yao, et al. A survey of context engineering for large language models. arXiv:2507.13334, 2025. 3

  76. [76]

    Writing tools for agents, 2025

    Anthropic. Writing tools for agents, 2025. URL https://www.anthropic.com/ engineering/writing-tools-for-agents. 3

  77. [77]

    LLM powered autonomous agents, 2023

    Lilian Weng. LLM powered autonomous agents, 2023. URL https://lilianweng.github. io/posts/2023-06-23-agent/. lilianweng.github.io. 3

  78. [78]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InUIST,

  79. [79]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. 7 13

  80. [80]

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A. Smith. TIFA: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

Showing first 80 references.