pith. machine review for the scientific record. sign in

arxiv: 2604.09167 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.MA

Recognition: 2 theorem links

· Lean Theorem

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:20 UTC · model grok-4.3

classification 💻 cs.CV cs.MA
keywords multi-agent systems3D scene understandinggrounded reasoningvision-language modelstraining-free methodsgeometric verificationspatial reasoningzero-shot generalization
0
0 comments X

The pith

A multi-agent framework lets off-the-shelf vision-language models perform grounded reasoning in 3D scenes without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that dividing 3D reasoning work among specialized agents lets general-purpose vision-language models handle open-ended spatial questions in complex scenes. A planning agent breaks tasks into steps and directs the flow, a grounding agent locates relevant objects and pulls the right scene views from many observations, and a coding agent writes and runs programs to verify geometric relations. This replaces both task-specific fine-tuning and rigid hand-built pipelines with dynamic collaboration. If the claim holds, 3D understanding becomes more adaptable to new environments using only existing models.

Core claim

MAG-3D is a training-free multi-agent framework that coordinates a planning agent to decompose queries and orchestrate reasoning, a grounding agent to execute free-form 3D object identification and relevant frame retrieval from extensive scene observations, and a coding agent to conduct flexible geometric reasoning and explicit verification through executable programs, thereby enabling effective grounded reasoning across diverse 3D scenes with off-the-shelf VLMs.

What carries the argument

The multi-agent collaboration architecture with a planning agent, grounding agent, and coding agent that dynamically coordinates off-the-shelf VLMs for task decomposition, free-form scene grounding, and programmatic geometric verification.

If this is right

  • The system applies to novel 3D scenes in a zero-shot manner without retraining.
  • Open-ended queries about spatial and geometric relationships become feasible without fixed procedures.
  • Explicit code-based verification reduces errors in geometric checks compared to text-only reasoning.
  • Performance reaches state-of-the-art levels on challenging 3D reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar agent divisions could apply to related domains such as video or embodied scene reasoning.
  • Lower dependence on 3D-specific training data may ease deployment in resource-limited settings.
  • The modular design could support integration with robotic perception pipelines for real-time spatial tasks.

Load-bearing premise

Off-the-shelf vision-language models can reliably carry out accurate 3D grounding and geometric verification when guided by the planning, grounding, and coding agents without any task-specific training.

What would settle it

A 3D scene and natural-language query on which the multi-agent system returns incorrect object grounding or an erroneous spatial verification that a human observer can directly confirm as wrong.

Figures

Figures reproduced from arXiv: 2604.09167 by Chenyue Fang, Gao Huang, Henry Zheng, Rui Huang, Siyuan Wei, Xiao Liu.

Figure 1
Figure 1. Figure 1: Comparison between existing methods and MAG-3D. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of MAG-3D. Given a question q and RGB observa￾tions I, the planning agent dynamically orchestrates expert agents for spatial ground￾ing, geometric reasoning, and retrieves information from scene memory M as needed. The framework maintains explicit, inspectable intermediate results (e.g., grounded in￾stances and geometric results), which are then aggregated and summarized into the fina… view at source ↗
Figure 3
Figure 3. Figure 3: Grounding-QA coherence on Beacon3D. Comparison of QA Score ≥ 75, Good Coherence, and R2 across state-of-the-art methods. We also compare against SceneCOT [28], which is trained on specially crafted data for 3D grounded reasoning, and observe that MAG-3D still achieves higher overall performance without in-domain training. In particular, relative to the previous best method, SceneCOT, in [PITH_FULL_IMAGE:f… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on Beacon3D. For each query, we show (top) the input question and RGB sequence, (middle) intermediate visual and geometric outputs, and (bottom) the predicted and ground-truth answers. 4.5 Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces MAG-3D, a training-free multi-agent framework for grounded 3D reasoning that coordinates three off-the-shelf VLMs: a planning agent that decomposes queries and orchestrates the process, a grounding agent that performs free-form object and region identification plus frame retrieval from 3D scene observations, and a coding agent that translates geometric queries into executable programs for explicit spatial verification. The central claim is that this collaborative design enables flexible zero-shot 3D reasoning across diverse scenes and attains state-of-the-art results on challenging benchmarks without task-specific training or hand-crafted pipelines.

Significance. If the experimental results hold, the work would be significant for demonstrating that dynamic multi-agent coordination of existing VLMs can address core 3D grounding and verification challenges more flexibly than prior tuned or pipeline-based methods. The training-free nature and explicit code-based verification step represent a potentially generalizable direction for open-ended 3D understanding.

major comments (3)
  1. [§4.2] §4.2 (Coding Agent): The central claim that the coding agent produces reliable executable programs for geometric verification is load-bearing, yet the manuscript provides no quantitative metrics on code-generation success rate, execution error frequency, or failure modes (e.g., incorrect coordinate transforms or distance computations). Without these, it is impossible to verify whether the “explicit verification” step actually contributes to the reported SOTA gains or merely passes through VLM hallucinations.
  2. [Table 3, §5.1] Table 3 and §5.1: The SOTA performance claims are presented without ablation studies that isolate the contribution of the coding agent (e.g., replacing it with direct VLM reasoning) or the multi-agent orchestration. The performance delta over single-VLM baselines is therefore not attributable to the proposed design, weakening the causal link between the framework and the benchmark results.
  3. [§3.3] §3.3 (Grounding Agent): The description of free-form 3D grounding and relevant-frame retrieval lacks detail on how the agent handles large-scale point clouds or meshes; no pseudocode, prompt templates, or failure-case analysis is supplied, making reproducibility and robustness assessment difficult despite the training-free claim.
minor comments (3)
  1. [Abstract, §1] The abstract and §1 repeatedly use “state-of-the-art” without immediately citing the specific benchmarks and prior numbers; a parenthetical reference to the main results table would improve clarity.
  2. [Figure 2] Figure 2 (agent interaction diagram) uses overlapping arrows that obscure the exact data flow between the grounding and coding agents; redrawing with clearer sequencing would aid comprehension.
  3. [§3.1, §4.1] Notation for 3D coordinates and bounding boxes is introduced inconsistently between §3.1 and §4.1; a single unified definition table would eliminate ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical support and reproducibility of MAG-3D. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Coding Agent): The central claim that the coding agent produces reliable executable programs for geometric verification is load-bearing, yet the manuscript provides no quantitative metrics on code-generation success rate, execution error frequency, or failure modes (e.g., incorrect coordinate transforms or distance computations). Without these, it is impossible to verify whether the “explicit verification” step actually contributes to the reported SOTA gains or merely passes through VLM hallucinations.

    Authors: We agree that direct quantitative evaluation of the coding agent would provide stronger evidence for its contribution. Although the end-to-end SOTA results on multiple benchmarks offer indirect support for the verification step's utility, we will add a dedicated analysis in the revision. This will include code-generation success rates, execution error frequencies, and categorized failure modes (such as coordinate transform errors) computed over the evaluation datasets, presented in a new table or subsection under §4.2. revision: yes

  2. Referee: [Table 3, §5.1] Table 3 and §5.1: The SOTA performance claims are presented without ablation studies that isolate the contribution of the coding agent (e.g., replacing it with direct VLM reasoning) or the multi-agent orchestration. The performance delta over single-VLM baselines is therefore not attributable to the proposed design, weakening the causal link between the framework and the benchmark results.

    Authors: We acknowledge that the current comparisons to single-VLM baselines in Table 3 do not fully isolate the individual components. To address this, we will perform and report additional ablation experiments in the revised §5.1 and Table 3 (or an extended table). These will include variants that replace the coding agent with direct VLM-based geometric reasoning and variants that use a single unified agent instead of the multi-agent orchestration, allowing clearer attribution of performance gains to the proposed design. revision: yes

  3. Referee: [§3.3] §3.3 (Grounding Agent): The description of free-form 3D grounding and relevant-frame retrieval lacks detail on how the agent handles large-scale point clouds or meshes; no pseudocode, prompt templates, or failure-case analysis is supplied, making reproducibility and robustness assessment difficult despite the training-free claim.

    Authors: We agree that expanded implementation details are needed for reproducibility. In the revised manuscript, we will augment §3.3 with pseudocode outlining the grounding and frame-retrieval procedure, example prompt templates provided to the off-the-shelf VLM, and a short discussion of observed failure cases when processing large-scale point clouds or meshes. These additions will clarify the training-free mechanism without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework using external VLMs with no self-referential derivations

full rationale

The paper describes a training-free multi-agent system (planning, grounding, and coding agents) built on off-the-shelf VLMs. No equations, fitted parameters, or predictions appear in the provided text. Claims of SOTA performance rest on benchmark evaluation rather than any reduction of outputs to self-defined inputs or self-citation chains. The central design is presented as a novel coordination method whose validity is externally testable, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Only the abstract is available, so the ledger reflects high-level design assumptions rather than detailed derivations. The framework assumes existing VLMs possess sufficient grounding and coding capabilities when properly orchestrated.

axioms (1)
  • domain assumption Off-the-shelf VLMs can perform free-form 3D grounding and executable geometric reasoning when coordinated by specialized agents
    This is the core premise enabling the training-free claim and is invoked throughout the abstract description of the agents.
invented entities (3)
  • Planning agent no independent evidence
    purpose: Decompose tasks and orchestrate reasoning process
    Newly proposed component in the multi-agent design; no independent evidence provided beyond the abstract claim.
  • Grounding agent no independent evidence
    purpose: Perform 3D grounding and frame retrieval
    Newly proposed component; no independent evidence beyond abstract.
  • Coding agent no independent evidence
    purpose: Conduct geometric reasoning via executable programs
    Newly proposed component; no independent evidence beyond abstract.

pith-pipeline@v0.9.0 · 5544 in / 1312 out tokens · 51541 ms · 2026-05-10T17:20:22.599313+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 32 canonical work pages · 7 internal anchors

  1. [1]

    Anthropic: Claude 3.5 sonnet (Jun 2024),https://www.anthropic.com/news/ claude-3-5-sonnet

  2. [2]

    arXiv (2025),https://ai.meta.com/ research/publications/locate- 3d- real- world- object- localization- via- self-supervised-learning-in-3d

    Arnaud*, S., McVay*, P., Ada Martin*, A.M., Jatavallabhula, K.M., Thomas, P., Partsey, R., Dugas, D., Gejji, A., Sax, A., Berges, V.P., Henaff, M., Jain, A., Cao, A., Prasad, I., Kalakrishnan, M., Rabbat, M., Ballas, N., Assran, M., Maksymets, O., Rajeswaran, A., Meier, F.: Locate 3d: Real-world object local- ization via self-supervised learning in 3d. ar...

  3. [3]

    ByteDance Seed Team: Seed1.6 (Jun 2025),https://seed.bytedance.com/en/ seed1_6

  4. [4]

    Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,

    Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y., Yin, W., Yang, Z., Wei, C., Sun, Q., Zhou, T., Li, J., Pang, H.E., Qian, O., Wei, Y., Lin, Z., Shi, X., Deng, K., Han, X., Chen, Z., Fan, X., Deng, H., Lu, L., Pan, L., Li, B., Liu, Z., Wang, Q., Lin, D., Yang, L.: Scaling spatial intelligence with multimodal foundation models. arXiv preprint arXiv:2511...

  5. [5]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

  6. [6]

    In: CVPR

    Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: CVPR. pp. 14455–14465 (2024)

  7. [7]

    In: CVPR

    Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., Chen, T.: Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In: CVPR. pp. 26428–26438 (2024)

  8. [8]

    Geometrically-constrained agent for spatial reasoning.arXiv preprint arXiv:2511.22659, 2025

    Chen, Z., Lu, X., Zheng, Z., Li, P., He, L., Zhou, Y., Shao, J., Zhuang, B., Sheng, L.: Geometrically-constrained agent for spatial reasoning. arXiv preprint arXiv:2511.22659 (2025)

  9. [9]

    Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

    Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., Feng, Y., Pei, P., Cai, X., Huang, R.: Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632 (2025)

  10. [10]

    NeurIPS37, 135062–135093 (2024)

    Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. NeurIPS37, 135062–135093 (2024)

  11. [11]

    DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C., Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., G...

  12. [12]

    org/abs/2501.01163

    Deng, J., He, T., Jiang, L., Wang, T., Dayoub, F., Reid, I.: 3d-llava: Towards generalist 3d lmms with omni superpoint transformer (2025),https://arxiv. org/abs/2501.01163

  13. [13]

    Fan, S., Cui, J., Guo, M.H., Yang, S.: Tool-augmented spatiotemporal reasoning for streamlining video question answering task (2025),https://arxiv.org/abs/ 2512.10359

  14. [14]

    In: European Conference on Computer Vision

    Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: European Conference on Computer Vision. pp. 75–92. Springer (2025)

  15. [15]

    Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., Xu, H., Theiss, J., Chen, T., Li, J., Tu, Z., Wang, Z., Ranjan, R.: Vlm- 3r: Vision-language models augmented with instruction-aligned 3d reconstruction (2025),https://arxiv.org/abs/2505.20279

  16. [16]

    Visual Programming: Compositional visual reasoning without training

    Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. ArXivabs/2211.11559(2022)

  17. [17]

    arXiv preprint arXiv:2510.07181 (2025)

    Han, Y., Chi, C., Zhou, E., Rong, S., An, J., Wang, P., Wang, Z., Sheng, L., Zhang, S.: Tiger: Tool-integrated geometric reasoning in vision-language models for robotics. arXiv preprint arXiv:2510.07181 (2025)

  18. [18]

    In: CVPR

    Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. In: CVPR. pp. 14281–14290 (2024)

  19. [19]

    Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N.A., Kr- ishna, R.: Visual sketchpad: Sketching as a visual chain of thought for multimodal language models (2024),https://arxiv.org/abs/2406.09403

  20. [20]

    NeurIPS (2024)

    Huang, H., Chen, Y., Wang, Z., Huang, R., Xu, R., Wang, T., Liu, L., Cheng, X., Zhao, Y., Pang, J., et al.: Chat-scene: Bridging 3d scene and large language models with object identifiers. NeurIPS (2024)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 18 H

    Huang, J., Jia, B., Wang, Y., Zhu, Z., Linghu, X., Li, Q., Zhu, S.C., Huang, S.: Un- veiling the mist over 3d vision-language understanding: Object-centric evaluation with chain-of-analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 18 H. Zheng et al

  22. [22]

    In: ICML

    Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.C., Jia, B., Huang, S.: An embodied generalist agent in 3d world. In: ICML. pp. 20413–20451 (2024)

  23. [23]

    Huang, S.Y., Choe, J., Wang, Y.C.F., Sun, C.: Openvoxel: Training-free grouping and captioning voxels for open-vocabulary 3d scene understanding (2026),https: //arxiv.org/abs/2601.09575

  24. [24]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  25. [25]

    In: ECCV

    Jia, B., Chen, Y., Yu, H., Wang, Y., Niu, X., Liu, T., Li, Q., Huang, S.: Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. In: ECCV. pp. 289–310. Springer (2024)

  26. [26]

    Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding (2025),https: //arxiv.org/abs/2504.14920

  27. [27]

    Advances in Neural Information Processing Systems (2024)

    Linghu, X., Huang, J., Niu, X., Ma, X., Jia, B., Huang, S.: Multi-modal situated reasoning in 3d scenes. Advances in Neural Information Processing Systems (2024)

  28. [28]

    In: Proceedings of the International Con- ference on Learning Representations (ICLR) (2026)

    Linghu, X., Huang, J., Zhu, Z., Jia, B., Huang, S.: Scenecot: Eliciting grounded chain-of-thought reasoning in 3d scenes. In: Proceedings of the International Con- ference on Learning Representations (ICLR) (2026)

  29. [29]

    NeurIPS36, 34892– 34916 (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS36, 34892– 34916 (2023)

  30. [30]

    Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

    Liu, Y., Zhang, B., Zang, Y., Cao, Y., Xing, L., Dong, X., Duan, H., Lin, D., Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025)

  31. [31]

    In: ICLR (2026),https://openreview.net/forum?id=yv15C8ql24

    Luo, Z., Zhang, C., Yong, S., Dai, C., Wang, Q., Ran, H., Shi, G., Sycara, K., Xie, Y.: pyspatial: Generating 3d visual programs for zero-shot spatial reasoning. In: ICLR (2026),https://openreview.net/forum?id=yv15C8ql24

  32. [32]

    OpenAI: Introducing chatgpt and whisper apis (Apr 2024),https://openai.com/ index/introducing-chatgpt-and-whisper-apis/

  33. [33]

    OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...

  34. [34]

    Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

    Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025)

  35. [35]

    Qi, Z., Zhang, Z., Fang, Y., Wang, J., Zhao, H.: Gpt4scene: Understand 3d scenes from videos with vision-language models (2025),https://arxiv.org/abs/2501. 01428

  36. [36]

    ai / blog ? from = research

    QwenTeam: Qwen3-coder: Agentic coding in the world (Jul 2025), https : / / qwen . ai / blog ? from = research . research - list & id = d927d7d2e59d059045ce758ded34f98c0186d2d7

  37. [37]

    Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools (2023),https://arxiv.org/abs/2302.04761

  38. [38]

    org/10.48550/arXiv.2210.03105

    Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: Mask transformer for 3d semantic instance segmentation (2023),https://arxiv. org/abs/2210.03105

  39. [39]

    In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

    Si, C., Zhang, Y., Li, R., Yang, Z., Liu, R., Yang, D.: Design2code: Benchmarking multimodal code generation for automated front-end engineering. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 3956–3974 (2025)

  40. [40]

    Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

    Su, Z., Li, L., Song, M., Hao, Y., Yang, Z., Zhang, J., Chen, G., Gu, J., Li, J., Qu, X., et al.: Openthinkimg: Learning to think with images via visual tool reinforce- ment learning. arXiv preprint arXiv:2505.08617 (2025)

  41. [41]

    arXiv preprint arXiv:2504.01901 (2025) 20 H

    Wang, H., Zhao, Y., Wang, T., Fan, H., Zhang, X., Zhang, Z.: Ross3d: Reconstruc- tive visual instruction tuning with 3d-awareness. arXiv preprint arXiv:2504.01901 (2025) 20 H. Zheng et al

  42. [42]

    In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

  43. [43]

    Wang, Y., Chen, Q., Li, Z., Wang, S., Guo, S., Zhang, Z., Wei, Z.: Simple o3: Towards interleaved vision-language reasoning (2025),https://arxiv.org/abs/ 2508.12109

  44. [45]

    V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

    Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135 (2023)

  45. [46]

    Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. arXiv preprint arXiv:2412.14171 (2024)

  46. [47]

    Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

    Yang, R., Zhu, Z., Li, Y., Huang, J., Yan, S., Zhou, S., Liu, Z., Li, X., Li, S., Wang, W., et al.: Visual spatial tuning. arXiv preprint arXiv:2511.05491 (2025)

  47. [48]

    In: International Conference on Learning Representations (ICLR) (2023)

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)

  48. [49]

    Yu, H., Li, W., Wang, S., Chen, J., Zhu, J.: Inst3d-lmm: Instance-aware 3d scene understanding with multi-modal instruction tuning (2025),https://arxiv.org/ abs/2503.00513

  49. [50]

    Yuan, H., Liu, Z., Zhou, J., Qian, H., Shu, Y., Sebe, N., Wen, J.R., Dou, Z.: Think with videos for agentic long-video understanding (2025),https://arxiv.org/abs/ 2506.10821

  50. [51]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    Zhang, H., Gu, X., Li, J., Ma, C., Bai, S., Zhang, C., Zhang, B., Zhou, Z., He, D., Tang, Y.: Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416 (2025)

  51. [52]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

    Zhang, J., Chen, Y., Zhou, Y., Xu, Y., Huang, Z., Mei, J., Chen, J., Yuan, Y., Cai, X., Huang, G., Quan, X., Xu, H., Zhang, L.: From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976 (2025)

  52. [53]

    Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

    Zhang, X., Jia, Z., Guo, Z., Li, J., Li, B., Li, H., Lu, Y.: Deep video discovery: Agentic search with tool use for long-form video understanding. arXiv preprint arXiv:2505.18079 (2025)

  53. [54]

    Zheng, D., Huang, S., Wang, L.: Video-3d llm: Learning position-aware video repre- sentation for 3d scene understanding (2024),https://arxiv.org/abs/2412.00493

  54. [55]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.14362

  55. [56]

    In: Proceedings IEEE Interna- tional Conference on Computer Vision (ICCV) (2025)

    Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. In: Proceedings IEEE Interna- tional Conference on Computer Vision (ICCV) (2025)

  56. [57]

    In: ICCV

    Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3d-vista: Pre-trained trans- former for 3d vision and text alignment. In: ICCV. pp. 2911–2921 (2023)

  57. [58]

    In: ECCV

    Zhu, Z., Zhang, Z., Ma, X., Niu, X., Chen, Y., Jia, B., Deng, Z., Huang, S., Li, Q.: Unifying 3d vision-language understanding via promptable queries. In: ECCV. pp. 188–206. Springer (2024) MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding A1 Appendix Contents of this appendix A Grounding Agent Implementation Details ......................... A1...