pith. machine review for the scientific record. sign in

arxiv: 2605.12449 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Alan Yuille, Chloe Wang, Jiawei Peng, Patrick Li, Siyi Chen, Wufei Ma

Pith reviewed 2026-05-13 06:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords simulation frameworkvision researchsynthetic dataout-of-distribution evaluationprocedural generationPython APILLM integrationclosed-loop control
0
0 comments X

The pith

LychSim provides a Python API and procedural pipeline that makes high-fidelity simulation controllable for vision research and closed-loop testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LychSim as a simulation framework on Unreal Engine 5 that lowers technical barriers for vision researchers who need controllable synthetic data and rigorous out-of-distribution evaluation. It centers on a streamlined Python interface that hides engine details, a procedural generator that creates varied environments with OOD visual challenges together with rich 2D and 3D ground truth, and native integration of the Model Context Protocol to turn the simulator into an interactive environment for reasoning language models. The authors annotate scene rules and object alignments to support automated modifications and demonstrate use cases in synthetic data production, reinforcement-learning adversarial examiners, and language-driven layout changes. A sympathetic reader cares because simulation remains essential for closed-loop optimization and OOD testing even after advances in self-supervised pretraining.

Core claim

LychSim is built around three designs: a streamlined Python API that abstracts underlying engine complexities, a procedural data pipeline that generates diverse high-fidelity environments with varying OOD visual challenges paired with rich 2D and 3D ground truths, and native Model Context Protocol integration that transforms the simulator into a dynamic closed-loop playground for reasoning agentic LLMs. Scene-level procedural rules and object-level pose alignments are annotated to enable semantically aligned 3D ground truths and automated scene modification.

What carries the argument

LychSim's three core designs: the streamlined Python API for control, the procedural data pipeline for environment and ground-truth generation, and the native MCP integration for interactive LLM-driven scene control.

Load-bearing premise

That the Python API and procedural pipeline are fully implemented, produce sufficiently diverse and high-fidelity OOD scenes, and that the promised public release includes complete usable code with accurate annotations.

What would settle it

Public release of the code where users without game-development expertise successfully generate varied OOD datasets, obtain matching 2D/3D ground truth, and run closed-loop experiments controlled by reasoning language models.

read the original abstract

While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LychSim, a controllable simulation framework built on Unreal Engine 5 for vision research. It centers on three designs: a streamlined Python API abstracting engine complexities, a procedural pipeline generating diverse high-fidelity OOD scenes with rich 2D/3D ground truths and scene-level annotations, and native MCP integration enabling closed-loop interaction with reasoning LLM agents. The work describes applications as a synthetic data engine, RL-based adversarial examiners, and language-driven scene layout, with plans for public release of code and annotations.

Significance. If the implementation matches the description and the code is released in usable form, LychSim could meaningfully reduce technical barriers for vision researchers seeking high-fidelity simulation for OOD evaluation and agentic closed-loop tasks, complementing existing platforms by emphasizing accessibility and LLM integration.

major comments (3)
  1. [Abstract] Abstract and demonstrations section: the central claims that the Python API, procedural pipeline, and MCP integration are fully functional and produce diverse high-fidelity OOD challenges rest entirely on descriptive assertions; no quantitative metrics (e.g., scene diversity scores, rendering latency, ground-truth alignment error, or comparison against baselines such as Habitat or AI2-THOR) are provided to support these assertions.
  2. [Demonstrations] Demonstrations section: the reported uses in synthetic data generation, RL adversarial examiners, and language-driven layout are presented without example outputs, success rates, ablation studies, or any empirical validation, leaving the bridging-the-gap claim unsupported by evidence.
  3. [Future release] Future release paragraph: the utility narrative depends on the promised public release of complete source code, annotations, and MCP integration; no current repository link, code snippets, or implementation details are supplied to allow verification of the described features.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'various data annotations' is vague; specify the exact annotation types (e.g., semantic segmentation, depth, pose) and their formats.
  2. [Related work] Consider adding a table comparing LychSim's feature set (API, MCP support, procedural rules) against at least two existing simulators to clarify its positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional empirical support would strengthen the manuscript. We address each major comment below and describe the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract and demonstrations section: the central claims that the Python API, procedural pipeline, and MCP integration are fully functional and produce diverse high-fidelity OOD challenges rest entirely on descriptive assertions; no quantitative metrics (e.g., scene diversity scores, rendering latency, ground-truth alignment error, or comparison against baselines such as Habitat or AI2-THOR) are provided to support these assertions.

    Authors: We agree that quantitative evidence would better substantiate the claims. In the revised manuscript we will add a dedicated evaluation subsection reporting scene diversity (via procedural parameter entropy and visual variation metrics), average rendering latency for scenes of varying complexity, and ground-truth alignment accuracy (measured against manually verified annotations). We will also include a feature-comparison table against Habitat and AI2-THOR emphasizing API accessibility and OOD generation. revision: yes

  2. Referee: [Demonstrations] Demonstrations section: the reported uses in synthetic data generation, RL adversarial examiners, and language-driven layout are presented without example outputs, success rates, ablation studies, or any empirical validation, leaving the bridging-the-gap claim unsupported by evidence.

    Authors: The current demonstrations are primarily illustrative. We will expand this section with concrete example outputs (sample generated scenes and annotations), quantitative success rates for the RL adversarial examiners (e.g., attack success on standard vision models), and preliminary results for language-driven layout tasks. While exhaustive ablations exceed the scope of a framework paper, we will include initial empirical numbers to support the utility claims. revision: partial

  3. Referee: [Future release] Future release paragraph: the utility narrative depends on the promised public release of complete source code, annotations, and MCP integration; no current repository link, code snippets, or implementation details are supplied to allow verification of the described features.

    Authors: We acknowledge the need for immediate verifiability. The revised manuscript will include code snippets demonstrating the Python API and MCP integration in an appendix. We commit to releasing the full source code, annotations, and MCP integration on a public repository upon acceptance and will provide reviewers with a private link during the revision cycle if the venue permits. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is a software framework description with no mathematical derivations, equations, predictions, fitted parameters, or first-principles results. Its claims concern the existence and utility of a Python API, procedural pipeline, and MCP integration, none of which are presented as derived from prior quantities by construction. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the work is self-contained as an engineering artifact whose validity depends on implementation completeness rather than any circular logical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software engineering contribution with no free parameters, axioms, or invented scientific entities.

pith-pipeline@v0.9.0 · 5553 in / 1022 out tokens · 45213 ms · 2026-05-13T06:21:53.733249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 4 internal anchors

  1. [1]

    Claude Opus 4.6, 2026

    Anthropic. Claude Opus 4.6, 2026. URL https://www.anthropic.com/news/ claude-opus-4-6. Accessed: 2026-04-11

  2. [2]

    Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

    E. Brown, A. Ray, R. Krishna, R. Girshick, R. Fergus, and S. Xie. Sims-v: Simulated instruction- tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

  3. [3]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  4. [4]

    Çelen, G

    A. Çelen, G. Han, K. Schindler, L. Van Gool, I. Armeni, A. Obukhov, and X. Wang. I-design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024

  5. [5]

    B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

  6. [6]

    Cheng, H

    A.-C. Cheng, H. Yin, Y. Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

  7. [7]

    Coumans and Y

    E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org, 2016–2021

  8. [8]

    Deitke, E

    M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, J. Salvador, K. Ehsani, W. Han, E. Kolve, A.Farhadi,A.Kembhavi,andR.Mottaghi. ProcTHOR:Large-ScaleEmbodiedAIUsingProcedural Generation. InNeurIPS, 2022. Outstanding Paper Award

  9. [9]

    J. Deng, W. Chai, J. Huang, Z. Zhao, Q. Huang, M. Gao, J. Guo, S. Hao, W. Hu, J.-N. Hwang, et al. Citycraft: A real crafter for 3d city generation.arXiv preprint arXiv:2406.04983, 2024

  10. [10]

    Dosovitskiy, G

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

  11. [11]

    Fab asset marketplace, 2026

    Epic Games. Fab asset marketplace, 2026. URLhttps://www.fab.com/. Unified marketplace for digital assets

  12. [12]

    W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023. 10 LychSim: A Controllable and Interactive Simulation Framework for Vision Research

  13. [13]

    Gaidon, Q

    A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. InCVPR, 2016

  14. [14]

    Girdhar and D

    R. Girdhar and D. Ramanan. Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

  15. [15]

    Gemma-4-31b, 2026

    Google DeepMind. Gemma-4-31b, 2026. URL https://huggingface.co/google/ gemma-4-31B. Hugging Face Model Card

  16. [16]

    K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  17. [17]

    Z. Hu, A. Iscen, A. Jain, T. Kipf, Y. Yue, D. A. Ross, C. Schmid, and A. Fathi. Scenecraft: An llm agent for synthesizing 3d scenes as blender code. InForty-first International Conference on Machine Learning, 2024

  18. [18]

    Johnson, B

    J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

  19. [19]

    Khanna, Y

    M. Khanna, Y. Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16384–16393, 2024

  20. [20]

    T. S. Kim, M. Peven, W. Qiu, A. Yuille, and G. D. Hager. Synthesizing attributes with unreal engine for fine-grained activity analysis. In2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pages 35–37. IEEE, 2019

  21. [21]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  22. [22]

    Kolve, R

    E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI.arXiv, 2017

  23. [23]

    Kulits, H

    P. Kulits, H. Feng, W. Liu, V. F. Abrevaya, and M. J. Black. Re-thinking inverse graphics with large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=u0eiu1MTS7

  24. [24]

    J. Lee, X. Wang, J. Peng, L. Ye, Z. Zheng, T. Zhang, T. Wang, W. Ma, S. Chen, Y.-C. Chou, et al. Perceptual taxonomy: Evaluating and guiding hierarchical scene reasoning in vision-language models.arXiv preprint arXiv:2511.19526, 2025

  25. [25]

    Leroy, Y

    V. Leroy, Y. Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

  26. [26]

    Z. Li, X. Wang, E. Stengel-Eskin, A. Kortylewski, W. Ma, B. Van Durme, and A. L. Yuille. Super- clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023. 11 LychSim: A Controllable and Interactive Simulation Framework for Visio...

  27. [27]

    Q. Liu, A. Kortylewski, and A. L. Yuille. Poseexaminer: Automated testing of out-of-distribution robustness in human pose and shape estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 672–681, 2023

  28. [28]

    W. Ma, Q. Liu, J. Wang, A. Wang, X. Yuan, Y. Zhang, Z. Xiao, G. Zhang, B. Lu, R. Duan, Y. Qi, A. Kortylewski, Y. Liu, and A. Yuille. Generating images with 3d annotations using diffusion models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=XlkN11Xj6J

  29. [29]

    W. Ma, G. Zhang, Q. Liu, G. Zeng, A. Kortylewski, Y. Liu, and A. Yuille. Imagenet3d: Towards general-purpose object-level 3d understanding.Advances in Neural Information Processing Systems, 37:96127–96149, 2024

  30. [30]

    W. Ma, H. Chen, G. Zhang, Y.-C. Chou, J. Chen, C. de Melo, and A. Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

  31. [31]

    W. Ma, L. Ye, C. M. de Melo, A. Yuille, and J. Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

  32. [32]

    W. Ma, S. Cen, J. Shen, R. Lee, L. Begiristain, Y. Zhuang, J. Peng, Z. Yu, T. Song, X. Qi, T. Shu, A. Kortylewski, and A. Yuille. Unrealspace: Analyzing spatial understanding and reasoning in controllable simulation. InFindings of the Computer Vision and Pattern Recognition Conference, 2026

  33. [33]

    Ma, Y.-C

    W. Ma, Y.-C. Chou, Q. Liu, X. Wang, C. de Melo, J. Xie, and A. Yuille. Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning.Advances in Neural Information Processing Systems, 38:140751–140774, 2026

  34. [34]

    C. Ning, J. Peng, J. Wang, Y. Sun, Y. Liu, A. Yuille, A. Kortylewski, and A. Wang. Part321: Recognizing 3d object parts from a 2d image using 1-shot annotations, 2024. URLhttps: //openreview.net/forum?id=jdFoxDnBwY

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  36. [36]

    X. Puig, E. Undersander, A. Szot, M. D. Cote, R. Partsey, J. Yang, R. Desai, A. W. Clegg, M. Hlavac, T. Min, T. Gervet, V. Vondrus, V.-P. Berges, J. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi. Habitat 3.0: A co-habitat for humans, avatars and robots, 2023

  37. [38]

    W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, and Y. Wang. Unrealcv: Virtual worlds for computer vision. InProceedings of the 25th ACM international conference on Multimedia, pages 1221–1224, 2017

  38. [39]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 12 LychSim: A Controllable and Interactive Simulation Framework for Vision Research

  39. [40]

    Raistrick, L

    A. Raistrick, L. Lipson, Z. Ma, L. Mei, M. Wang, Y. Zuo, K. Kayan, H. Wen, B. Han, Y. Wang, A. Newell, H. Law, A. Goyal, K. Yang, and J. Deng. Infinite photorealistic worlds using procedural generation. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 12630–12641, 2023

  40. [41]

    Raistrick, L

    A. Raistrick, L. Mei, K. Kayan, D. Yan, Y. Zuo, B. Han, H. Wen, M. Parakh, S. Alexandropoulos, L. Lipson, Z. Ma, and J. Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21783–21794, June 2024

  41. [42]

    A. Ray, J. Duan, E. Brown, R. Tan, D. Bashkirova, R. Hendrix, K. Ehsani, A. Kembhavi, B. A. Plummer, R. Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

  42. [43]

    Roberts, J

    M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

  43. [44]

    G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016

  44. [45]

    N. Ruiz, A. Kortylewski, W. Qiu, C. Xie, S. A. Bargal, A. Yuille, and S. Sclaroff. Simulated adversarial testing of face recognition models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4145–4155, June 2022

  45. [46]

    M. Shu, C. Liu, W. Qiu, and A. Yuille. Identifying model weakness with adversarial examiner. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 11998–12006, 2020

  46. [47]

    DINOv3

    O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  47. [48]

    H. Slim, X. Li, Y. Li, M. Ahmed, M. Ayman, U. Upadhyay, A. Abdelreheem, A. Prajapati, S. Poth- igara, P.Wonka, etal. 3dcompat++: Animprovedlarge-scale3dvisiondatasetforcompositional recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  48. [49]

    F.-Y. Sun, W. Liu, S. Gu, D. Lim, G. Bhat, F. Tombari, M. Li, N. Haber, and J. Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

  49. [50]

    Todorov, T

    E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

  50. [51]

    F. Tosi, Y. Liao, C. Schmitt, and A. Geiger. Smd-nets: Stereo mixture density networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  51. [52]

    H. Wang, Q. Xue, and W. Gao. Infinibench: Infinite benchmarking for visual spatial reasoning with customizable scene complexity.arXiv preprint arXiv:2511.18200, 2025

  52. [53]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 13 LychSim: A Controllable and Interactive Simulation Framework for Vision Research

  53. [54]

    S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  54. [55]

    X. Wang, W. Ma, Z. Li, A. Kortylewski, and A. L. Yuille. 3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36: 58717–58735, 2023

  55. [56]

    X. Wang, W. Ma, T. Zhang, C. M. de Melo, J. Chen, and A. Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24669–24679, 2025

  56. [57]

    Yang, F.-Y

    Y. Yang, F.-Y. Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

  57. [58]

    X. Ye, J. Ren, Y. Zhuang, X. He, Y. Liang, Y. Yang, M. Dogra, X. Zhong, E. Liu, K. Benavente, et al. Simworld: An open-ended simulator for agents in physical and social worlds. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  58. [59]

    K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

  59. [60]

    S. Yin, J. Ge, Z. Z. Wang, X. Li, M. J. Black, T. Darrell, A. Kanazawa, and H. Feng. Vision-as- inverse-graphics agent via interleaved multimodal reasoning, 2026. URLhttps://arxiv. org/abs/2601.11109

  60. [61]

    Zhang, M

    H. Zhang, M. Liu, Z. Li, H. Wen, W. Guan, Y. Wang, and L. Nie. Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

  61. [62]

    Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

    J. Zhang, C. Herrmann, J. Hur, V. Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

  62. [63]

    B. Zhao, S. Yu, W. Ma, M. Yu, S. Mei, A. Wang, J. He, A. Yuille, and A. Kortylewski. Ood-cv: A benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. InEuropean conference on computer vision, pages 163–180. Springer, 2022

  63. [64]

    B. Zhao, J. Wang, W. Ma, A. Jesslen, S. Yang, S. Yu, O. Zendel, C. Theobalt, A. L. Yuille, and A. Kortylewski. Ood-cv-v2: An extended benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11104–11118, 2024

  64. [65]

    Zhong, K

    F. Zhong, K. Wu, C. Wang, H. Chen, H. Ci, Z. Li, and Y. Wang. Unrealzoo: Enriching photo- realistic virtual worlds for embodied ai. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5769–5779, 2025

  65. [66]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021. 14 LychSim: A Controllable and Interactive Simulation Framework for Vision Research Appendix A. Technical Details A.1. Comprehensive 2D and 3D Ground Truths LychSim extracts comprehensive ground-trut...

  66. [67]

    Interactive annotation tool and procedural rules in LychSim: Figure 6

  67. [68]

    Example MCP tool schema: Code 1

  68. [69]

    Claude skill for scene planning: Code 2

  69. [70]

    name":"spawn_object

    User input for loft office specification: Code 3. 17 LychSim: A Controllable and Interactive Simulation Framework for Vision Research Figure 6|Interactive annotation tool and procedural rules in LychSim. HaI environment lightingHbI fog simulationHcI rain simulation HeI surface normalHfI instance segmentationHgI point maps HhI object ground truths and occl...

  70. [71]

    **Read the spec.**Parse asset paths, room geometry (floor corners, X/Y/Z ranges), layout requirements, placement options.↩→ 22

  71. [72]

    The room is rarely empty — there are usually persistent scene props you should not delete

    **Snapshot the current state.**In parallel:`list_objects`, `get_camera_location`,`get_camera_rotation`, then`get_camera_lit`. The room is rarely empty — there are usually persistent scene props you should not delete. ↩→ ↩→ ↩→ 24

  72. [73]

    Functional groupings beat scattered placement

    **Plan zones, not coordinates.**Sketch the layout in zones (desk area, reading nook, plant corners) before computing positions. Functional groupings beat scattered placement. ↩→ ↩→ 26

  73. [74]

    adjust_if_possible

    **Place anchors first.**Spawn the largest anchoring objects (table, soft chair) before stacking smaller items on/around them. Use `collision_handling: "adjust_if_possible"`from the spec. ↩→ ↩→ 28

  74. [75]

    If`get_mesh_extent`works, use it; otherwise estimate

    **Stack using estimated heights.**A standard desk top is at floor Z + ~75cm. If`get_mesh_extent`works, use it; otherwise estimate. Place monitor/books/vase Y between desk Y±40 so they land on the desk, not a neighboring chair. ↩→ ↩→ ↩→ 21 LychSim: A Controllable and Interactive Simulation Framework for Vision Research 30

  75. [76]

    **Place chairs with rotation last.**Don't trust your first guess at chair facing — see Mesh Forward Direction below.↩→ 32

  76. [77]

    Side views for chair orientations

    **Verify from multiple angles.**Top-down (`pitch=-89`, high Z) for layout. Side views for chair orientations. Wide-angle from a corner for the final beauty shot. ↩→ ↩→ 34

  77. [78]

    The user expects you to look at every screenshot critically and self-correct

    **Iterate.**Fix overlaps, wrong-facing chairs, items inside furniture. The user expects you to look at every screenshot critically and self-correct. ↩→ ↩→ 36

  78. [79]

    Final camera location and rotation

    **Restore the final camera pose.**When the scene is done, move the camera to the**"Final camera location and rotation"**values specified in the spec (e.g.`office.md`). This is the canonical hero-shot pose the user expects to see when they next open the scene. Use `set_camera_location`and`set_camera_rotation`, then take one last `get_camera_lit`to confirm....

  79. [80]

    Final camera location and rotation

    **Desktop items.**Place desktop items at the table-top Z, not on the floor.↩→ 40 41## Coordinate system 42 43-Centimeters,**left-handed, Z-up**. 44-`yaw=0` →forward =**+X**.`yaw=90` →+Y.`yaw=-90` →-Y.`yaw=180` → -X.↩→ 45-Floor is typically at Z = -20 in LoftOffice scenes; furniture spawn locations sit at floor Z (objects pivot from their base for most sta...