arxiv: 2605.12449 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

Alan Yuille, Chloe Wang, Jiawei Peng, Patrick Li, Siyi Chen, Wufei Ma

Pith reviewed 2026-05-13 06:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords simulation frameworkvision researchsynthetic dataout-of-distribution evaluationprocedural generationPython APILLM integrationclosed-loop control

0 comments

The pith

LychSim provides a Python API and procedural pipeline that makes high-fidelity simulation controllable for vision research and closed-loop testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LychSim as a simulation framework on Unreal Engine 5 that lowers technical barriers for vision researchers who need controllable synthetic data and rigorous out-of-distribution evaluation. It centers on a streamlined Python interface that hides engine details, a procedural generator that creates varied environments with OOD visual challenges together with rich 2D and 3D ground truth, and native integration of the Model Context Protocol to turn the simulator into an interactive environment for reasoning language models. The authors annotate scene rules and object alignments to support automated modifications and demonstrate use cases in synthetic data production, reinforcement-learning adversarial examiners, and language-driven layout changes. A sympathetic reader cares because simulation remains essential for closed-loop optimization and OOD testing even after advances in self-supervised pretraining.

Core claim

LychSim is built around three designs: a streamlined Python API that abstracts underlying engine complexities, a procedural data pipeline that generates diverse high-fidelity environments with varying OOD visual challenges paired with rich 2D and 3D ground truths, and native Model Context Protocol integration that transforms the simulator into a dynamic closed-loop playground for reasoning agentic LLMs. Scene-level procedural rules and object-level pose alignments are annotated to enable semantically aligned 3D ground truths and automated scene modification.

What carries the argument

LychSim's three core designs: the streamlined Python API for control, the procedural data pipeline for environment and ground-truth generation, and the native MCP integration for interactive LLM-driven scene control.

Load-bearing premise

That the Python API and procedural pipeline are fully implemented, produce sufficiently diverse and high-fidelity OOD scenes, and that the promised public release includes complete usable code with accurate annotations.

What would settle it

Public release of the code where users without game-development expertise successfully generate varied OOD datasets, obtain matching 2D/3D ground truth, and run closed-loop experiments controlled by reasoning language models.

read the original abstract

While self-supervised pretraining has reduced vision systems' reliance on synthetic data, simulation remains an indispensable tool for closed-loop optimization and rigorous out-of-distribution (OOD) evaluation. However, modern simulation platforms often present steep technical barriers, requiring extensive expertise in computer graphics and game development. In this work, we present LychSim, a highly controllable and interactive simulation framework built upon Unreal Engine 5 to bridge this gap. LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP) that transforms the simulator into a dynamic, closed-loop playground for reasoning agentic LLMs. We further annotate scene-level procedural rules and object-level pose alignments to enable semantically aligned 3D ground truths and automated scene modification. We demonstrate LychSim's capability across multiple downstream applications, including serving as a synthetic data engine, powering reinforcement learning-based adversarial examiners, and facilitating interactive, language-driven scene layout generation. To benefit the broader vision community, LychSim will be made publicly available, including full source code and various data annotations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LychSim is a UE5-based simulator with a Python API and MCP integration for LLM control, but the paper offers only high-level descriptions with no metrics or validation.

read the letter

LychSim presents a simulation framework on Unreal Engine 5 aimed at making controllable synthetic data and OOD testing easier for vision researchers. It includes a streamlined Python API, a procedural pipeline for generating scenes with 2D/3D ground truth, and native Model Context Protocol support so LLMs can drive scene changes in a closed loop. The authors also add scene-level rules and pose annotations to support automated modifications. They show example uses in synthetic data generation, RL adversarial examiners, and language-driven layout changes. The public release is promised with code and annotations. That combination of features, especially the MCP tie-in, is the main new element compared to existing engine-based simulators. The paper does a clear job laying out the motivation and the intended workflow without overclaiming theoretical advances. The central limitation is the complete absence of any quantitative checks. There are no numbers on scene diversity, visual fidelity, annotation accuracy, API latency, or how well the OOD challenges actually stress models. No code examples, output images, or timing results appear in the text. The utility argument therefore rests on the assumption that the described components are fully working and will be released in usable form. If the MCP integration or procedural alignment turns out to be incomplete, the bridging-the-gap story does not hold. This is a tools paper for vision researchers who already want to run closed-loop simulation or generate custom OOD data but lack graphics expertise. A reader who needs a ready Python interface to UE5 might get practical value once the code ships. It is worth sending to peer review so the authors can add basic validation metrics and confirm the release status. The work is straightforward and honest in its scope.

Referee Report

3 major / 2 minor

Summary. The paper introduces LychSim, a controllable simulation framework built on Unreal Engine 5 for vision research. It centers on three designs: a streamlined Python API abstracting engine complexities, a procedural pipeline generating diverse high-fidelity OOD scenes with rich 2D/3D ground truths and scene-level annotations, and native MCP integration enabling closed-loop interaction with reasoning LLM agents. The work describes applications as a synthetic data engine, RL-based adversarial examiners, and language-driven scene layout, with plans for public release of code and annotations.

Significance. If the implementation matches the description and the code is released in usable form, LychSim could meaningfully reduce technical barriers for vision researchers seeking high-fidelity simulation for OOD evaluation and agentic closed-loop tasks, complementing existing platforms by emphasizing accessibility and LLM integration.

major comments (3)

[Abstract] Abstract and demonstrations section: the central claims that the Python API, procedural pipeline, and MCP integration are fully functional and produce diverse high-fidelity OOD challenges rest entirely on descriptive assertions; no quantitative metrics (e.g., scene diversity scores, rendering latency, ground-truth alignment error, or comparison against baselines such as Habitat or AI2-THOR) are provided to support these assertions.
[Demonstrations] Demonstrations section: the reported uses in synthetic data generation, RL adversarial examiners, and language-driven layout are presented without example outputs, success rates, ablation studies, or any empirical validation, leaving the bridging-the-gap claim unsupported by evidence.
[Future release] Future release paragraph: the utility narrative depends on the promised public release of complete source code, annotations, and MCP integration; no current repository link, code snippets, or implementation details are supplied to allow verification of the described features.

minor comments (2)

[Abstract] Abstract: the phrase 'various data annotations' is vague; specify the exact annotation types (e.g., semantic segmentation, depth, pose) and their formats.
[Related work] Consider adding a table comparing LychSim's feature set (API, MCP support, procedural rules) against at least two existing simulators to clarify its positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional empirical support would strengthen the manuscript. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and demonstrations section: the central claims that the Python API, procedural pipeline, and MCP integration are fully functional and produce diverse high-fidelity OOD challenges rest entirely on descriptive assertions; no quantitative metrics (e.g., scene diversity scores, rendering latency, ground-truth alignment error, or comparison against baselines such as Habitat or AI2-THOR) are provided to support these assertions.

Authors: We agree that quantitative evidence would better substantiate the claims. In the revised manuscript we will add a dedicated evaluation subsection reporting scene diversity (via procedural parameter entropy and visual variation metrics), average rendering latency for scenes of varying complexity, and ground-truth alignment accuracy (measured against manually verified annotations). We will also include a feature-comparison table against Habitat and AI2-THOR emphasizing API accessibility and OOD generation. revision: yes
Referee: [Demonstrations] Demonstrations section: the reported uses in synthetic data generation, RL adversarial examiners, and language-driven layout are presented without example outputs, success rates, ablation studies, or any empirical validation, leaving the bridging-the-gap claim unsupported by evidence.

Authors: The current demonstrations are primarily illustrative. We will expand this section with concrete example outputs (sample generated scenes and annotations), quantitative success rates for the RL adversarial examiners (e.g., attack success on standard vision models), and preliminary results for language-driven layout tasks. While exhaustive ablations exceed the scope of a framework paper, we will include initial empirical numbers to support the utility claims. revision: partial
Referee: [Future release] Future release paragraph: the utility narrative depends on the promised public release of complete source code, annotations, and MCP integration; no current repository link, code snippets, or implementation details are supplied to allow verification of the described features.

Authors: We acknowledge the need for immediate verifiability. The revised manuscript will include code snippets demonstrating the Python API and MCP integration in an appendix. We commit to releasing the full source code, annotations, and MCP integration on a public repository upon acceptance and will provide reviewers with a private link during the revision cycle if the venue permits. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper is a software framework description with no mathematical derivations, equations, predictions, fitted parameters, or first-principles results. Its claims concern the existence and utility of a Python API, procedural pipeline, and MCP integration, none of which are presented as derived from prior quantities by construction. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains; the work is self-contained as an engineering artifact whose validity depends on implementation completeness rather than any circular logical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software engineering contribution with no free parameters, axioms, or invented scientific entities.

pith-pipeline@v0.9.0 · 5553 in / 1022 out tokens · 45213 ms · 2026-05-13T06:21:53.733249+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear
LychSim is built around three key designs: (1) a streamlined Python API that abstracts away underlying engine complexities; (2) a procedural data pipeline capable of generating diverse, high-fidelity environments with varying out-of-distribution (OOD) visual challenges, paired with rich 2D and 3D ground truths; and (3) a native integration of the Model Context Protocol (MCP)
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear
We will release our LychSim publicly, including: (1) the complete C++ and Python source code, and (2) associated data annotations, such as procedural rules for scene generation and pose alignments for object meshes.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 4 internal anchors

[1]

Claude Opus 4.6, 2026

Anthropic. Claude Opus 4.6, 2026. URL https://www.anthropic.com/news/ claude-opus-4-6. Accessed: 2026-04-11

work page 2026
[2]

Sims-v: Simulated instruction-tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

E. Brown, A. Ray, R. Krishna, R. Girshick, R. Fergus, and S. Xie. Sims-v: Simulated instruction- tuning for spatial video understanding.arXiv preprint arXiv:2511.04668, 2025

work page arXiv 2025
[3]

Caron, H

M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[4]

Çelen, G

A. Çelen, G. Han, K. Schindler, L. Van Gool, I. Armeni, A. Obukhov, and X. Wang. I-design: Personalized llm interior designer. InEuropean Conference on Computer Vision, pages 217–234. Springer, 2024

work page 2024
[5]

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

work page 2024
[6]

Cheng, H

A.-C. Cheng, H. Yin, Y. Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems, 37:135062–135093, 2024

work page 2024
[7]

Coumans and Y

E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning.http://pybullet.org, 2016–2021

work page 2016
[8]

Deitke, E

M. Deitke, E. VanderBilt, A. Herrasti, L. Weihs, J. Salvador, K. Ehsani, W. Han, E. Kolve, A.Farhadi,A.Kembhavi,andR.Mottaghi. ProcTHOR:Large-ScaleEmbodiedAIUsingProcedural Generation. InNeurIPS, 2022. Outstanding Paper Award

work page 2022
[9]

J. Deng, W. Chai, J. Huang, Z. Zhao, Q. Huang, M. Gao, J. Guo, S. Hao, W. Hu, J.-N. Hwang, et al. Citycraft: A real crafter for 3d city generation.arXiv preprint arXiv:2406.04983, 2024

work page arXiv 2024
[10]

Dosovitskiy, G

A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017

work page 2017
[11]

Fab asset marketplace, 2026

Epic Games. Fab asset marketplace, 2026. URLhttps://www.fab.com/. Unified marketplace for digital assets

work page 2026
[12]

W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang. Layoutgpt: Compositional visual planning and generation with large language models.Advances in Neural Information Processing Systems, 36:18225–18250, 2023. 10 LychSim: A Controllable and Interactive Simulation Framework for Vision Research

work page 2023
[13]

Gaidon, Q

A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. InCVPR, 2016

work page 2016
[14]

Girdhar and D

R. Girdhar and D. Ramanan. Cater: A diagnostic dataset for compositional actions and temporal reasoning.arXiv preprint arXiv:1910.04744, 2019

work page arXiv 1910
[15]

Gemma-4-31b, 2026

Google DeepMind. Gemma-4-31b, 2026. URL https://huggingface.co/google/ gemma-4-31B. Hugging Face Model Card

work page 2026
[16]

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[17]

Z. Hu, A. Iscen, A. Jain, T. Kipf, Y. Yue, D. A. Ross, C. Schmid, and A. Fathi. Scenecraft: An llm agent for synthesizing 3d scenes as blender code. InForty-first International Conference on Machine Learning, 2024

work page 2024
[18]

Johnson, B

J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

work page 2017
[19]

Khanna, Y

M. Khanna, Y. Mao, H. Jiang, S. Haresh, B. Shacklett, D. Batra, A. Clegg, E. Undersander, A. X. Chang, and M. Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal navigation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16384–16393, 2024

work page 2024
[20]

T. S. Kim, M. Peven, W. Qiu, A. Yuille, and G. D. Hager. Synthesizing attributes with unreal engine for fine-grained activity analysis. In2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pages 35–37. IEEE, 2019

work page 2019
[21]

Kirillov, E

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

work page 2023
[22]

Kolve, R

E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, D. Gordon, Y. Zhu, A. Gupta, and A. Farhadi. AI2-THOR: An Interactive 3D Environment for Visual AI.arXiv, 2017

work page 2017
[23]

Kulits, H

P. Kulits, H. Feng, W. Liu, V. F. Abrevaya, and M. J. Black. Re-thinking inverse graphics with large language models.Transactions on Machine Learning Research, 2024. ISSN 2835-8856. URLhttps://openreview.net/forum?id=u0eiu1MTS7

work page 2024
[24]

J. Lee, X. Wang, J. Peng, L. Ye, Z. Zheng, T. Zhang, T. Wang, W. Ma, S. Chen, Y.-C. Chou, et al. Perceptual taxonomy: Evaluating and guiding hierarchical scene reasoning in vision-language models.arXiv preprint arXiv:2511.19526, 2025

work page arXiv 2025
[25]

Leroy, Y

V. Leroy, Y. Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision, pages 71–91. Springer, 2024

work page 2024
[26]

Z. Li, X. Wang, E. Stengel-Eskin, A. Kortylewski, W. Ma, B. Van Durme, and A. L. Yuille. Super- clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023. 11 LychSim: A Controllable and Interactive Simulation Framework for Visio...

work page 2023
[27]

Q. Liu, A. Kortylewski, and A. L. Yuille. Poseexaminer: Automated testing of out-of-distribution robustness in human pose and shape estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 672–681, 2023

work page 2023
[28]

W. Ma, Q. Liu, J. Wang, A. Wang, X. Yuan, Y. Zhang, Z. Xiao, G. Zhang, B. Lu, R. Duan, Y. Qi, A. Kortylewski, Y. Liu, and A. Yuille. Generating images with 3d annotations using diffusion models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=XlkN11Xj6J

work page 2024
[29]

W. Ma, G. Zhang, Q. Liu, G. Zeng, A. Kortylewski, Y. Liu, and A. Yuille. Imagenet3d: Towards general-purpose object-level 3d understanding.Advances in Neural Information Processing Systems, 37:96127–96149, 2024

work page 2024
[30]

W. Ma, H. Chen, G. Zhang, Y.-C. Chou, J. Chen, C. de Melo, and A. Yuille. 3dsrbench: A comprehensive 3d spatial reasoning benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6924–6934, 2025

work page 2025
[31]

W. Ma, L. Ye, C. M. de Melo, A. Yuille, and J. Chen. Spatialllm: A compound 3d-informed design towards spatially-intelligent large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17249–17260, 2025

work page 2025
[32]

W. Ma, S. Cen, J. Shen, R. Lee, L. Begiristain, Y. Zhuang, J. Peng, Z. Yu, T. Song, X. Qi, T. Shu, A. Kortylewski, and A. Yuille. Unrealspace: Analyzing spatial understanding and reasoning in controllable simulation. InFindings of the Computer Vision and Pattern Recognition Conference, 2026

work page 2026
[33]

Ma, Y.-C

W. Ma, Y.-C. Chou, Q. Liu, X. Wang, C. de Melo, J. Xie, and A. Yuille. Spatialreasoner: Towards explicit and generalizable 3d spatial reasoning.Advances in Neural Information Processing Systems, 38:140751–140774, 2026

work page 2026
[34]

C. Ning, J. Peng, J. Wang, Y. Sun, Y. Liu, A. Yuille, A. Kortylewski, and A. Wang. Part321: Recognizing 3d object parts from a 2d image using 1-shot annotations, 2024. URLhttps: //openreview.net/forum?id=jdFoxDnBwY

work page 2024
[35]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

X. Puig, E. Undersander, A. Szot, M. D. Cote, R. Partsey, J. Yang, R. Desai, A. W. Clegg, M. Hlavac, T. Min, T. Gervet, V. Vondrus, V.-P. Berges, J. Turner, O. Maksymets, Z. Kira, M. Kalakrishnan, J. Malik, D. S. Chaplot, U. Jain, D. Batra, A. Rai, and R. Mottaghi. Habitat 3.0: A co-habitat for humans, avatars and robots, 2023

work page 2023
[38]

W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, and Y. Wang. Unrealcv: Virtual worlds for computer vision. InProceedings of the 25th ACM international conference on Multimedia, pages 1221–1224, 2017

work page 2017
[39]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. 12 LychSim: A Controllable and Interactive Simulation Framework for Vision Research

work page 2021
[40]

Raistrick, L

A. Raistrick, L. Lipson, Z. Ma, L. Mei, M. Wang, Y. Zuo, K. Kayan, H. Wen, B. Han, Y. Wang, A. Newell, H. Law, A. Goyal, K. Yang, and J. Deng. Infinite photorealistic worlds using procedural generation. InProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 12630–12641, 2023

work page 2023
[41]

Raistrick, L

A. Raistrick, L. Mei, K. Kayan, D. Yan, Y. Zuo, B. Han, H. Wen, M. Parakh, S. Alexandropoulos, L. Lipson, Z. Ma, and J. Deng. Infinigen indoors: Photorealistic indoor scenes using procedural generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21783–21794, June 2024

work page 2024
[42]

A. Ray, J. Duan, E. Brown, R. Tan, D. Bashkirova, R. Hendrix, K. Ehsani, A. Kembhavi, B. A. Plummer, R. Krishna, et al. Sat: Dynamic spatial aptitude training for multimodal language models.arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024
[43]

Roberts, J

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021

work page 2021
[44]

G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016

work page 2016
[45]

N. Ruiz, A. Kortylewski, W. Qiu, C. Xie, S. A. Bargal, A. Yuille, and S. Sclaroff. Simulated adversarial testing of face recognition models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4145–4155, June 2022

work page 2022
[46]

M. Shu, C. Liu, W. Qiu, and A. Yuille. Identifying model weakness with adversarial examiner. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 11998–12006, 2020

work page 2020
[47]

DINOv3

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

H. Slim, X. Li, Y. Li, M. Ahmed, M. Ayman, U. Upadhyay, A. Abdelreheem, A. Prajapati, S. Poth- igara, P.Wonka, etal. 3dcompat++: Animprovedlarge-scale3dvisiondatasetforcompositional recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

work page 2025
[49]

F.-Y. Sun, W. Liu, S. Gu, D. Lim, G. Bhat, F. Tombari, M. Li, N. Haber, and J. Wu. Layoutvlm: Differentiable optimization of 3d layout via vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29469–29478, 2025

work page 2025
[50]

Todorov, T

E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

work page 2012
[51]

F. Tosi, Y. Liao, C. Schmitt, and A. Geiger. Smd-nets: Stereo mixture density networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[52]

H. Wang, Q. Xue, and W. Gao. Infinibench: Infinite benchmarking for visual spatial reasoning with customizable scene complexity.arXiv preprint arXiv:2511.18200, 2025

work page arXiv 2025
[53]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 13 LychSim: A Controllable and Interactive Simulation Framework for Vision Research

work page 2025
[54]

S. Wang, V. Leroy, Y. Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

work page 2024
[55]

X. Wang, W. Ma, Z. Li, A. Kortylewski, and A. L. Yuille. 3d-aware visual question answering about parts, poses and occlusions.Advances in Neural Information Processing Systems, 36: 58717–58735, 2023

work page 2023
[56]

X. Wang, W. Ma, T. Zhang, C. M. de Melo, J. Chen, and A. Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24669–24679, 2025

work page 2025
[57]

Yang, F.-Y

Y. Yang, F.-Y. Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, et al. Holodeck: Language guided generation of 3d embodied ai environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16227–16237, 2024

work page 2024
[58]

X. Ye, J. Ren, Y. Zhuang, X. He, Y. Liang, Y. Yang, M. Dogra, X. Zhong, E. Liu, K. Benavente, et al. Simworld: An open-ended simulator for agents in physical and social worlds. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

work page
[59]

K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

work page arXiv 1910
[60]

S. Yin, J. Ge, Z. Z. Wang, X. Li, M. J. Black, T. Darrell, A. Kanazawa, and H. Feng. Vision-as- inverse-graphics agent via interleaved multimodal reasoning, 2026. URLhttps://arxiv. org/abs/2601.11109

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Zhang, M

H. Zhang, M. Liu, Z. Li, H. Wen, W. Guan, Y. Wang, and L. Nie. Spatial understanding from videos: Structured prompts meet simulation data.arXiv preprint arXiv:2506.03642, 2025

work page arXiv 2025
[62]

Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

J. Zhang, C. Herrmann, J. Hur, V. Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825, 2024

work page arXiv 2024
[63]

B. Zhao, S. Yu, W. Ma, M. Yu, S. Mei, A. Wang, J. He, A. Yuille, and A. Kortylewski. Ood-cv: A benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images. InEuropean conference on computer vision, pages 163–180. Springer, 2022

work page 2022
[64]

B. Zhao, J. Wang, W. Ma, A. Jesslen, S. Yang, S. Yu, O. Zendel, C. Theobalt, A. L. Yuille, and A. Kortylewski. Ood-cv-v2: An extended benchmark for robustness to out-of-distribution shifts of individual nuisances in natural images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11104–11118, 2024

work page 2024
[65]

Zhong, K

F. Zhong, K. Wu, C. Wang, H. Chen, H. Ci, Z. Li, and Y. Wang. Unrealzoo: Enriching photo- realistic virtual worlds for embodied ai. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5769–5779, 2025

work page 2025
[66]

iBOT: Image BERT Pre-Training with Online Tokenizer

J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832, 2021. 14 LychSim: A Controllable and Interactive Simulation Framework for Vision Research Appendix A. Technical Details A.1. Comprehensive 2D and 3D Ground Truths LychSim extracts comprehensive ground-trut...

work page internal anchor Pith review arXiv 2021
[67]

Interactive annotation tool and procedural rules in LychSim: Figure 6

work page
[68]

Example MCP tool schema: Code 1

work page
[69]

Claude skill for scene planning: Code 2

work page
[70]

name":"spawn_object

User input for loft office specification: Code 3. 17 LychSim: A Controllable and Interactive Simulation Framework for Vision Research Figure 6|Interactive annotation tool and procedural rules in LychSim. HaI environment lightingHbI fog simulationHcI rain simulation HeI surface normalHfI instance segmentationHgI point maps HhI object ground truths and occl...

work page
[71]

**Read the spec.**Parse asset paths, room geometry (floor corners, X/Y/Z ranges), layout requirements, placement options.↩→ 22

work page
[72]

The room is rarely empty — there are usually persistent scene props you should not delete

**Snapshot the current state.**In parallel:`list_objects`, `get_camera_location`,`get_camera_rotation`, then`get_camera_lit`. The room is rarely empty — there are usually persistent scene props you should not delete. ↩→ ↩→ ↩→ 24

work page
[73]

Functional groupings beat scattered placement

**Plan zones, not coordinates.**Sketch the layout in zones (desk area, reading nook, plant corners) before computing positions. Functional groupings beat scattered placement. ↩→ ↩→ 26

work page
[74]

adjust_if_possible

**Place anchors first.**Spawn the largest anchoring objects (table, soft chair) before stacking smaller items on/around them. Use `collision_handling: "adjust_if_possible"`from the spec. ↩→ ↩→ 28

work page
[75]

If`get_mesh_extent`works, use it; otherwise estimate

**Stack using estimated heights.**A standard desk top is at floor Z + ~75cm. If`get_mesh_extent`works, use it; otherwise estimate. Place monitor/books/vase Y between desk Y±40 so they land on the desk, not a neighboring chair. ↩→ ↩→ ↩→ 21 LychSim: A Controllable and Interactive Simulation Framework for Vision Research 30

work page
[76]

**Place chairs with rotation last.**Don't trust your first guess at chair facing — see Mesh Forward Direction below.↩→ 32

work page
[77]

Side views for chair orientations

**Verify from multiple angles.**Top-down (`pitch=-89`, high Z) for layout. Side views for chair orientations. Wide-angle from a corner for the final beauty shot. ↩→ ↩→ 34

work page
[78]

The user expects you to look at every screenshot critically and self-correct

**Iterate.**Fix overlaps, wrong-facing chairs, items inside furniture. The user expects you to look at every screenshot critically and self-correct. ↩→ ↩→ 36

work page
[79]

Final camera location and rotation

**Restore the final camera pose.**When the scene is done, move the camera to the**"Final camera location and rotation"**values specified in the spec (e.g.`office.md`). This is the canonical hero-shot pose the user expects to see when they next open the scene. Use `set_camera_location`and`set_camera_rotation`, then take one last `get_camera_lit`to confirm....

work page
[80]

Final camera location and rotation

**Desktop items.**Place desktop items at the table-top Z, not on the floor.↩→ 40 41## Coordinate system 42 43-Centimeters,**left-handed, Z-up**. 44-`yaw=0` →forward =**+X**.`yaw=90` →+Y.`yaw=-90` →-Y.`yaw=180` → -X.↩→ 45-Floor is typically at Z = -20 in LoftOffice scenes; furniture spawn locations sit at floor Z (objects pivot from their base for most sta...

work page