pith. sign in

arxiv: 2601.11109 · v3 · submitted 2026-01-16 · 💻 cs.CV · cs.AI· cs.GR

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Pith reviewed 2026-05-16 13:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR
keywords vision-as-inverse-graphicsmultimodal reasoningcode-render-inspect loopiterative editingBlenderBenchVLM agent3D reconstructioninverse graphics
0
0 comments X

The pith

VIGA agent reconstructs images into editable programs using interleaved code and visual reasoning

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes VIGA to overcome the limitations of vision-language models that lack fine-grained spatial grounding in one-shot image-to-program tasks. It introduces an interleaved multimodal reasoning framework built around a code-render-inspect loop that synthesizes symbolic programs, projects them into visual states, and inspects discrepancies to guide refinements. Equipped with semantic skills and evolving multimodal memory, the agent sustains evidence-based edits over long sequences. A sympathetic reader would care because this training-free method enables accurate 2D document generation, 3D reconstruction, multi-step editing, and 4D interaction that current one-shot approaches cannot sustain.

Core claim

VIGA operates through a tightly coupled code-render-inspect loop where symbolic logic and visual perception actively cross-verify each other, allowing the synthesis of programs that are rendered to images, inspected for discrepancies, and iteratively edited using an evolving multimodal memory to sustain long-horizon modifications without task-specific training.

What carries the argument

The code-render-inspect loop, in which symbolic programs are generated, rendered into visual states, and discrepancies are inspected to drive iterative edits while maintaining consistency via multimodal memory.

If this is right

  • Delivers accuracy improvements of 35.32 percent on BlenderGym, 117.17 percent on SlideBench, and 124.70 percent on the new BlenderBench over one-shot baselines.
  • Seamlessly handles 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction in one task-agnostic framework.
  • Remains training-free, relying on the loop and memory rather than fine-tuning for broad applicability.
  • The cross-verification between symbolic code and rendered visuals reduces reliance on perfect one-shot spatial understanding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same loop structure could be tested on inferring editable parameters from real-world video rather than synthetic renders.
  • If memory prevents drift, similar interleaved verification might apply to other agent tasks requiring precise spatial output like robot motion planning.
  • The gains on BlenderBench suggest the method scales to harder visual-to-code problems where single-pass generation fails.

Load-bearing premise

The vision-language model can sustain evidence-based iterative edits over long horizons without error accumulation or drift in the code-render-inspect loop.

What would settle it

Observing clear performance degradation or increasing visual discrepancies after many iterations on long-horizon tasks within BlenderBench would show that sustained accuracy does not hold.

Figures

Figures reproduced from arXiv: 2601.11109 by Angjoo Kanazawa, Chenyang Wang, Haiwen Feng, Jiaxin Ge, Michael J. Black, Shaofeng Yin, Trevor Darrell, Xiuyu Li, Zora Zhiruo Wang.

Figure 1
Figure 1. Figure 1: VIGA constructs 3D scenes as executable programs from [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The main pipeline of VIGA. VIGA operates through a continuous code￾render-inspect loop. At each step, the agent synthesizes a program, which is executed to render a new scene. The agent then actively inspects this scene by invoking read-only perceptual interfaces to adjust camera viewpoints, identify the dominant discrepancy, and feed this visual feedback into the next step. We illustrate the scene editing… view at source ↗
Figure 3
Figure 3. Figure 3: Agentic Visual Navigation. Demonstration of a visual inspection trajectory in a cluttered scene. The agent autonomously invokes spatial tools (focusing, zooming) to locate the inconspicuous target before evaluating attributes, showcasing an interleaved code-visual reasoning loop. this iterative analysis-by-synthesis process via interleaved multimodal reasoning. Specifically, we formulate this as a dynamic … view at source ↗
Figure 4
Figure 4. Figure 4: Fine-Grained Visual Grounding. Agent trajectories across different tasks. The agent detects nuanced visual discrepancies (e.g., mouth shape, lighting color) and dynamically maps these high-level observations directly to precise code parameters, avoiding rigid rule-based heuristics. Cross-Modal Execution. Unlike standard coding tasks where the generated program pt serves as a terminal textual artifact, our … view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on Evolving Multimodal Memory. We compare the sequential generation process with (w/) and without (w/o) our sliding window memory mechanism. With the memory window, the agent successfully maintains long-horizon context, progressively building a coherent scene (tree, sofa, fireplace). Without it, the agent suffers from severe context loss, losing previously generated objects (e.g., tree) and spatia… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Results on 3D Scene Gen￾eration. VIGA accurately generates high-fidelity 3D scenes from diverse visual inputs with precise semantic and visual alignment [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Results on 4D Dynamic Scene Simulation. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the BlenderBench evaluation tracks. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
read the original abstract

Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction. Finally, we introduce BlenderBench, a challenging visual-to-code benchmark. Empirically, VIGA substantially improves accuracy compared with one-shot baselines in BlenderGym (35.32%), SlideBench (117.17%) and our proposed BlenderBench (124.70%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces VIGA, a training-free, task-agnostic interleaved multimodal reasoning agent for vision-as-inverse-graphics tasks. It operates via a tightly coupled code-render-inspect loop that synthesizes symbolic programs, renders them visually, and uses discrepancies to guide iterative edits, supported by semantic skills and multimodal memory. The framework is applied to 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction, and the authors introduce BlenderBench as a new benchmark. Empirically, VIGA is claimed to deliver relative accuracy gains of 35.32% on BlenderGym, 117.17% on SlideBench, and 124.70% on BlenderBench over one-shot baselines.

Significance. If the central loop stability claim holds and the reported gains are reproducible with proper controls, the work would offer a concrete demonstration of how symbolic-visual cross-verification can enable long-horizon, training-free performance on inverse-graphics problems that one-shot VLMs struggle with. The introduction of BlenderBench and the explicit emphasis on evidence-based iterative editing would also provide a useful testbed for future agent research.

major comments (2)
  1. [Abstract] Abstract: the headline accuracy improvements (35.32% BlenderGym, 117.17% SlideBench, 124.70% BlenderBench) are stated without any accompanying information on the number of trials, error bars, statistical significance, or how discrepancies are quantified and fed back into the loop; these omissions make the central empirical claim impossible to evaluate.
  2. [Abstract] Abstract and framework description: no ablation or diagnostic results are supplied on iteration depth, per-step error rates, or drift metrics in the code-render-inspect loop, even though the paper's training-free advantage and long-horizon performance rest entirely on the assumption that the VLM can sustain evidence-based edits without compounding hallucinations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the empirical presentation. We have revised the manuscript to provide the requested details on trial counts, statistical reporting, and loop diagnostics while preserving the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline accuracy improvements (35.32% BlenderGym, 117.17% SlideBench, 124.70% BlenderBench) are stated without any accompanying information on the number of trials, error bars, statistical significance, or how discrepancies are quantified and fed back into the loop; these omissions make the central empirical claim impossible to evaluate.

    Authors: We agree the abstract requires additional context. The revised abstract now states that gains are averaged over 100 trials per benchmark with standard deviations reported in the main results (Section 4). Discrepancies are quantified via a combination of LPIPS perceptual distance and symbolic AST diff, verbalized into natural-language feedback that drives the next edit; this mechanism is detailed in Section 3.2. Paired t-test p-values (<0.01) confirming significance are added to the supplementary material. revision: yes

  2. Referee: [Abstract] Abstract and framework description: no ablation or diagnostic results are supplied on iteration depth, per-step error rates, or drift metrics in the code-render-inspect loop, even though the paper's training-free advantage and long-horizon performance rest entirely on the assumption that the VLM can sustain evidence-based edits without compounding hallucinations.

    Authors: We accept that the original submission lacked explicit diagnostics. The revised manuscript adds Section 4.5 with an ablation on iteration depth (Table 4 shows diminishing returns beyond 4 iterations), per-step error-rate curves (Figure 8), and a drift metric defined as the fraction of edits that increase reconstruction error (held below 7% by the multimodal memory). These results support loop stability over the reported horizons. revision: yes

Circularity Check

0 steps flagged

No circularity; VIGA framework is an independent empirical contribution

full rationale

The paper introduces VIGA as a training-free interleaved multimodal reasoning agent built around a code-render-inspect loop that cross-verifies symbolic programs and visual states. No mathematical derivations, fitted parameters, or self-citations are used to justify the core mechanism or the reported accuracy gains. The improvements (35–124% relative to one-shot baselines) are presented as direct empirical measurements on BlenderGym, SlideBench, and the new BlenderBench rather than predictions forced by construction from the inputs. The central claim rests on the stability of the iterative loop itself, which is described as a novel process without reduction to prior fitted quantities or renamed known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that current VLMs can generate usable symbolic programs and interpret rendered visual feedback for iterative correction without additional training.

axioms (2)
  • domain assumption VLMs can generate symbolic programs that, when rendered, produce visual states comparable to input images
    Invoked as the basis for the synthesis and inspection steps in the loop
  • domain assumption Discrepancies between rendered and target images can be reliably translated into program edits
    Required for the iterative refinement to converge over long horizons
invented entities (1)
  • VIGA agent no independent evidence
    purpose: To implement the interleaved multimodal reasoning for vision-as-inverse-graphics
    New framework introduced to couple symbolic logic with visual perception

pith-pipeline@v0.9.0 · 5522 in / 1334 out tokens · 30415 ms · 2026-05-16T13:46:57.651269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

  2. Text-Guided 6D Object Pose Rearrangement via Closed-Loop VLM Agents

    cs.CV 2026-04 unverdicted novelty 6.0

    Closed-loop VLM agents using multi-view reasoning, object-centered visualization, and single-axis rotation prediction achieve superior text-guided 6D pose rearrangement for target objects in scenes.

  3. Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

    cs.CV 2026-05 unverdicted novelty 5.0

    Code-as-Room is an MLLM-based agentic pipeline that parses top-down images into multi-stage Blender code synthesis with cross-stage memory to generate functional 3D rooms.

  4. LychSim: A Controllable and Interactive Simulation Framework for Vision Research

    cs.CV 2026-05 unverdicted novelty 4.0

    LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · cited by 4 Pith papers · 4 internal anchors

  1. [1]

    Aguina-Kang, R., Gumin, M., Han, D.H., Morris, S., Yoo, S.J., Ganeshan, A., Jones, R.K., Wei, Q.A., Fu, K., Ritchie, D.: Open-universe indoor scene generation using llm program synthesis and uncurated object databases (2024),https://arxiv.org/abs/2403.096754

  2. [2]

    Retrieved from https://deepmind.google/models/gemini/pro/ 22

    AI, G.D..G.: Gemini 2.5 pro: A reasoning-optimized multimodal model (2025), model page. Retrieved from https://deepmind.google/models/gemini/pro/ 22

  3. [3]

    Retrieved from https://docs.cloud.google.com/vertex- ai/generative-ai/docs/partner-models/claude/sonnet-4 22

    Anthropic: Claude sonnet 4 (model id: claude-sonnet-4@20250514) (2025), model card. Retrieved from https://docs.cloud.google.com/vertex- ai/generative-ai/docs/partner-models/claude/sonnet-4 22

  4. [4]

    Anthropic: The complete guide to building skills for claude. Tech. rep., Anthropic (jan 2026), https://resources.anthropic.com/hubfs/The- Complete-Guide-to-Building-Skill-for-Claude.pdf , accessed: 2026- 03-05 9

  5. [5]

    Azam, R., Vempaty, A., Jagmohan, A.: Reflection-based memory for web navigation agents (2025),https://arxiv.org/abs/2506.021584

  6. [6]

    In: Mahamood, S., Minh, N.L., Ippolito, D

    Bandyopadhyay, S., Maheshwari, H., Natarajan, A., Saxena, A.: Enhancing presentation slide generation by LLMs with a multi-staged end-to-end ap- proach. In: Mahamood, S., Minh, N.L., Ippolito, D. (eds.) Proceedings of the 17th International Natural Language Generation Conference. pp. 222–

  7. [7]

    https://doi.org/10.18653/v1/2024.inlg-main.184

    Association for Computational Linguistics, Tokyo, Japan (Sep 2024). https://doi.org/10.18653/v1/2024.inlg-main.184

  8. [8]

    Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (1999),https: //api.semanticscholar.org/CorpusID:2037052112

    Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. Seminal Graphics Papers: Pushing the Boundaries, Volume 2 (1999),https: //api.semanticscholar.org/CorpusID:2037052112

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14455–14465 (June 2024) 4

  10. [10]

    In: NeurIPS (2024) 4

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. In: NeurIPS (2024) 4

  11. [11]

    In: NeurIPS (2022), outstanding Paper Award 4

    Deitke, M., VanderBilt, E., Herrasti, A., Weihs, L., Salvador, J., Ehsani, K., Han, W., Kolve, E., Farhadi, A., Kembhavi, A., Mottaghi, R.: ProcTHOR: Large-Scale Embodied AI Using Procedural Generation. In: NeurIPS (2022), outstanding Paper Award 4

  12. [12]

    In: Proceedings of the British Machine Vision Conference (BMVC)

    Dwedari, M.M., Niessner, M., Chen, Z.: Generating context-aware natural answers for questions in 3d scenes. In: Proceedings of the British Machine Vision Conference (BMVC). BMVA Press (2023) 4 16 Shaofeng Yin et al

  13. [13]

    In: Proceedings of the 37th International Conference on Neural Information Processing Systems

    Feng, W., Zhu, W., Fu, T.j., Jampani, V., Akula, A., He, X., Basu, S., Wang, X.E., Wang, W.Y.: Layoutgpt: compositional visual planning and genera- tion with large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23 (2023) 4

  14. [14]

    ACM Transactions on Graphics (ToG), Proc

    Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG), Proc. SIGGRAPH40(4), 88:1–88:13 (Aug 2021) 2

  15. [15]

    In: European Conference on Computer Vision

    Ge, J., Subramanian, S., Shi, B., Herzig, R., Darrell, T.: Recursive visual programming. In: European Conference on Computer Vision. pp. 1–18. Springer (2024) 4

  16. [16]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ge, J., Wang, Z.Z., Zhou, X., Peng, Y.H., Subramanian, S., Tan, Q., Sap, M., Suhr, A., Fried, D., Neubig, G., et al.: Autopresent: Designing structured visuals from scratch. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2902–2911 (2025) 4, 14, 21, 24

  17. [17]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Gu, Y., Huang, I., Je, J., Yang, G., Guibas, L.: Blendergym: Benchmarking foundational model systems for graphics editing. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18574–18583 (2025) 4, 5, 11, 12, 14, 21

  18. [18]

    Gulwani, S., Polozov, O., Singh, R.: Program synthesis. Found. Trends Pro- gram. Lang.4(1–2), 1–119 (2017).https://doi.org/10.1561/2500000010 4

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gupta, T., Kembhavi, A.: Visual programming: Compositional visual rea- soning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) 4

  20. [20]

    Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans.Softw

    He, J., Treude, C., Lo, D.: Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead. ACMTrans.Softw. Eng. Methodol.34(5) (May 2025). https://doi.org/10.1145/3712003 4

  21. [21]

    Hong, K., Troynikov, A., Huber, J.: Context rot: How increasing input tokens impacts llm performance. Tech. rep., Chroma (July 2025),https: //research.trychroma.com/context-rot8

  22. [22]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Hong, Y., Zhen, H., Chen, P., Zheng, S., Du, Y., Chen, Z., Gan, C.: 3dllm: Injecting the 3d world into large language models. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 36, pp. 20482–20494. Curran Associates, Inc. (2023) 4

  23. [23]

    Hu, Y., Stretcu, O., Lu, C.T., Viswanathan, K., Hata, K., Luo, E., Kr- ishna, R., Fuxman, A.: Visual program distillation: Distilling tools and programmatic reasoning into vision-language models (2023) 4

  24. [24]

    In: Forty-first International Conference on Machine Learning (2024) 4

    Hu, Z., Iscen, A., Jain, A., Kipf, T., Yue, Y., Ross, D.A., Schmid, C., Fathi, A.: Scenecraft: An llm agent for synthesizing 3d scenes as blender code. In: Forty-first International Conference on Machine Learning (2024) 4

  25. [25]

    In: European Conference on Computer Vision

    Huang, I., Yang, G., Guibas, L.: Blenderalchemy: Editing 3d graphics with vision-language models. In: European Conference on Computer Vision. pp. 297–314. Springer (2024) 12, 14, 21

  26. [26]

    In: ACM SIGGRAPH Asia 2023 Conference Papers

    Kodnongbua, M., Jones, B., Ahmad, M.B.S., Kim, V., Schulz, A.: Reparam- cad: Zero-shot cad re-parameterization for interactive manipulation. In: ACM SIGGRAPH Asia 2023 Conference Papers. ACM, New York, NY, USA (2023).https://doi.org/10.1145/3610548.36182194 Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning 17

  27. [27]

    In: ICLR 2024 Workshop on Large Language Model (LLM) Agents (2024),https://openreview.net/ forum?id=RPKxrKTJbj4

    Koh, J.Y., Lo, R., Jang, L., Duvvur, V., Lim, M.C., Huang, P.Y., Neubig, G., Zhou, S., Salakhutdinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: ICLR 2024 Workshop on Large Language Model (LLM) Agents (2024),https://openreview.net/ forum?id=RPKxrKTJbj4

  28. [28]

    Transactions on Machine Learning Research (2024) 5

    Kulits,P.,Feng,H.,Liu,W.,Abrevaya,V.F.,Black,M.J.:Re-thinkinginverse graphics with large language models. Transactions on Machine Learning Research (2024) 5

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: A probabilistic programming language for scene perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society (June 2015) 4

  30. [30]

    Scenethesis: A language and vision agentic framework for 3d scene generation,

    Ling, L., Lin, C.H., Lin, T.Y., Ding, Y., Zeng, Y., Sheng, Y., Ge, Y., Liu, M.Y., Bera, A., Li, Z.: Scenethesis: A language and vision agentic framework for 3d scene generation. arXiv preprint arXiv:2505.02836 (2025) 4, 5, 24

  31. [31]

    Transactions of the Association for Computational Linguistics12, 157–173 (2024).https://doi.org/10.1162/tacl_a_006388

    Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics12, 157–173 (2024).https://doi.org/10.1162/tacl_a_006388

  32. [32]

    2019 IEEE/CVF International Con- ference on Computer Vision (ICCV) pp

    Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: A differentiable ren- derer for image-based 3d reasoning. 2019 IEEE/CVF International Con- ference on Computer Vision (ICCV) pp. 7707–7716 (2019),https://api. semanticscholar.org/CorpusID:1024840002

  33. [33]

    In: European Conference on Computer Vision (2014), https : / / api

    Loper, M., Black, M.J.: Opendr: An approximate differentiable renderer. In: European Conference on Computer Vision (2014), https : / / api . semanticscholar.org/CorpusID:178680982

  34. [34]

    Meshy: Meshy: Fast 3d generative ai.https://www.meshy.ai/(2024) 14

  35. [35]

    In: Proc

    Öcal, B.M., Tatarchenko, M., Karaoğlu, S., Gevers, T.: Sceneteller: Language- to-3d scene generation. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXXV. pp. 362–378 (2024).https://doi.org/10.1007/978- 3- 031- 73013-9_214

  36. [36]

    Retrieved from https://cdn.openai.com/gpt-4o-system-card.pdf 22

    OpenAI: Gpt-4o (“omni”): An autoregressive omni model for text, vision, audio and video (2024), system Card. Retrieved from https://cdn.openai.com/gpt-4o-system-card.pdf 22

  37. [37]

    OpenAI: Gpt-5 (Aug 2025), https://openai.com/index/introducing- gpt-5/, released August 7, 2025 22

  38. [38]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Qin, Y., Xu, Z., Liu, Y.: Apply hierarchical-chain-of-generation to complex attributes text-to-3d generation. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 18521–18530 (2025),https: //api.semanticscholar.org/CorpusID:2784813494

  39. [39]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326 (2025) 4

  40. [40]

    J.: Infinite photorealistic worlds using procedural generation

    Raistrick, A., Lipson, L., Ma, Z., Mei, L., Wang, M., Zuo, Y., Kayan, K., Wen, H., Han, B., Wang, Y., Newell, A., Law, H., Goyal, A., Yang, K., Deng, 18 Shaofeng Yin et al. J.: Infinite photorealistic worlds using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12630–12641 (2023) 4

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Raistrick, A., Mei, L., Kayan, K., Yan, D., Zuo, Y., Han, B., Wen, H., Parakh, M., Alexandropoulos, S., Lipson, L., Ma, Z., Deng, J.: Infinigen indoors: Photorealistic indoor scenes using procedural generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21783–21794 (June 2024) 4

  42. [42]

    Computer Graphics Forum42(2), 545–568 (2023),https://onlinelibrary.wiley.com/doi/ abs/10.1111/cgf.147754

    Ritchie, D., Guerrero, P., Jones, R.K., Mitra, N.J., Schulz, A., Willis, K.D.D., Wu, J.: Neurosymbolic models for computer graphics. Computer Graphics Forum42(2), 545–568 (2023),https://onlinelibrary.wiley.com/doi/ abs/10.1111/cgf.147754

  43. [43]

    Roberts, L.G.: Machine Perception of Three-Dimensional Solids. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA (1963) 2, 4

  44. [44]

    arXiv preprint arXiv:2306.05392 (2023) 4

    Subramanian, S., Narasimhan, M., Khangaonkar, K., Yang, K., Nagrani, A., Schmid, C., Zeng, A., Darrell, T., Klein, D.: Modular visual question answering via code generation. arXiv preprint arXiv:2306.05392 (2023) 4

  45. [45]

    In: 2025 International Conference on 3D Vision (3DV)

    Sun, C., Han, J., Deng, W., Wang, X., Qin, Z., Gould, S.: 3d-gpt: Procedural 3d modeling with large language models. In: 2025 International Conference on 3D Vision (3DV). pp. 1253–1263. IEEE (2025) 4, 5

  46. [46]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Sun, F.Y., Liu, W., Gu, S., Lim, D., Bhat, G., Tombari, F., Li, M., Haber, N., Wu, J.: Layoutvlm: Differentiable optimization of 3d layout via vision- language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29469–29478 (2025)

  47. [47]

    Sun, F.Y., Wu, S., Jacobsen, C., Yim, T., Zou, H., Zook, A., Li, S., Chou, Y.H.,Can,E.,Wu,X.,Eppner,C.,Blukis,V.,Tremblay,J.,Wu,J.,Birchfield, S., Haber, N.: 3d-generalist: Self-improving vision-language-action models for crafting 3d worlds (2025),https://arxiv.org/abs/2507.06484 2, 4, 24

  48. [48]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Surís, D., Menon, S., Vondrick, C.: Vipergpt: Visual inference via python execution for reasoning. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11888–11898 (2023) 4

  49. [49]

    Dy- namic cheatsheet: Test-time learning with adaptive memory, 2025.URL https://arxiv

    Suzgun, M., Yuksekgonul, M., Bianchi, F., Jurafsky, D., Zou, J.: Dy- namic cheatsheet: Test-time learning with adaptive memory. arXiv preprint arXiv:2504.07952 (2025) 4

  50. [50]

    Tang, W., Xiao, J., Jiang, W., Xiao, X., Wang, Y., Tang, X., Li, Q., Ma, Y., Liu, J., Tang, S., Lyu, M.R.: Slidecoder: Layout-aware rag-enhanced hierarchical slide generation from design (2025),https://arxiv.org/abs/ 2506.079644

  51. [51]

    arXiv preprint arXiv:2510.03463 , year=

    Tawosi, V., Ramani, K., Alamir, S., Liu, X.: Almas: an autonomous llm- based multi-agent software engineering framework (2025),https://arxiv. org/abs/2510.034634

  52. [52]

    SAM 3D: 3Dfy Anything in Images

    Team, S.D., Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., Lin, A., Liu, J., Ma, Z., Sagar, A., Song, B., Wang, X., Yang, J., Zhang, B., Dollár, P., Gkioxari, G., Feiszli, M., Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning 19 Malik, J.: Sam 3d: 3dfy anything in images. arXiv p...

  53. [53]

    Tripo AI: Tripo: Ai 3d model generator.https://www.tripo3d.ai/ (2024) 14

  54. [54]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Wang, F., Zhao, Z., Liu, Y., Zhang, D., Gao, J., Sun, H., Li, X.: Svgen: Interpretable vector graphics generation with large language models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9608–9617. MM ’25, Association for Computing Machinery, New York, NY, USA (2025).https://doi.org/10.1145/3746027.37550114

  55. [55]

    In: Proceedings of the 38th International Conference on Neural Information Processing Systems

    Wang, J., Ming, Y., Shi, Z., Vineet, V., Wang, X., Li, Y., Joshi, N.: Is a picture worth a thousand words? delving into spatial reasoning for vision language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. NIPS ’24 (2024) 4

  56. [56]

    In: Forty-second International Conference on Machine Learning (2025),https: //openreview.net/forum?id=NTAhi2JEEE4

    Wang, Z.Z., Mao, J., Fried, D., Neubig, G.: Agent workflow memory. In: Forty-second International Conference on Machine Learning (2025),https: //openreview.net/forum?id=NTAhi2JEEE4

  57. [57]

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 2, 5

    Wu, J., Tenenbaum, J.B., Kohli, P.: Neural scene de-rendering. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) 2, 5

  58. [58]

    2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Wu, R., Su, W., Liao, J.: Chat2svg: Vector graphics generation with large language models and image diffusion models. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 23690–23700 (2024),https://api.semanticscholar.org/CorpusID:2742805544

  59. [59]

    In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2024), https://openreview.net/forum?id=tN61DTr4Ed4

    Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T.J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., Yu, T.: OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. In: The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks...

  60. [60]

    Empowering llms to understand and generate complex vector graphics

    Xing, X., Hu, J., Liang, G., Zhang, J., Xu, D., Yu, Q.: Empowering llms to understand and generate complex vector graphics. arXiv preprint arXiv:2412.11102 (2024) 4

  61. [61]

    In: Advances in Neural Information Processing Systems (2025) 4

    Xu, W., Liang, Z., Mei, K., Gao, H., Tan, J., Zhang, Y.: A-mem: Agentic memory for llm agents. In: Advances in Neural Information Processing Systems (2025) 4

  62. [62]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xue, L., Gao, M., Xing, C., Martín-Martín, R., Wu, J., Xiong, C., Xu, R., Niebles, J.C., Savarese, S.: Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1179–1189. IEEE Computer Society (2023) 4

  63. [63]

    Advances in Neural Information Processing Systems37, 50528– 50652 (2024) 4 20 Shaofeng Yin et al

    Yang, J., Jimenez, C.E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K., Press, O.: Swe-agent: Agent-computer interfaces enable automated software engineering. Advances in Neural Information Processing Systems37, 50528– 50652 (2024) 4 20 Shaofeng Yin et al

  64. [64]

    Omnisvg: A unified scalable vector graphics generation model.arXiv preprint arXiv:2504.06263, 2025

    Yang, Y., Cheng, W., Chen, S., Zeng, X., Zhang, J., Wang, L., Yu, G., Ma, X., Jiang, Y.G.: Omnisvg: A unified scalable vector graphics generation model. arXiv preprint arXiv:2504.06263 (2025) 4

  65. [65]

    Advances in Neural Information Processing Systems35, 20744–20757 (2022) 4

    Yao, S., Chen, H., Yang, J., Narasimhan, K.: Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems35, 20744–20757 (2022) 4

  66. [66]

    Ye, Y.: Task memory engine: Spatial memory for robust multi-step llm agents (2025),https://arxiv.org/abs/2505.194364

  67. [67]

    In: Advances in Neural Information Processing Systems (NeurIPS) 31

    Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural- symbolic vqa: Disentangling reasoning from vision and language understand- ing. In: Advances in Neural Information Processing Systems (NeurIPS) 31. pp. 1039–1050 (2018) 2

  68. [68]

    arXiv preprint arXiv:2307.02485 , year=

    Zhang, H., Du, W., Shan, J., Zhou, Q., Du, Y., Tenenbaum, J.B., Shu, T., Gan, C.: Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485 (2023) 4

  69. [69]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Zhang, Q., Hu, C., Upasani, S., Ma, B., Hong, F., Kamanuru, V., Rainton, J., Wu, C., Ji, M., Li, H., et al.: Agentic context engineering: Evolving contexts for self-improving language models. arXiv preprint arXiv:2510.04618 (2025) 4

  70. [70]

    ID": "2401.13641

    Zheng, H., Guan, X., Kong, H., Zheng, J., Zhou, W., Lin, H., Lu, Y., He, B., Han, X., Sun, L.: Pptagent: Generating and evaluating presentations beyond text-to-slides. arXiv preprint arXiv:2501.03936 (2025) 4

  71. [71]

    Zhou, M., Wang, Y., Hou, J., Zhang, S., Li, Y., Luo, C., Peng, J., Zhang, Z.: Scenex: procedural controllable large-scale scene generation. In: Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty- Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Ar...

  72. [72]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., et al.: Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854 (2023) 4 Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning Supplementary Material A Evaluation Settings A.1 Quantitative Settin...