arxiv: 2604.09167 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.MA

Recognition: 2 theorem links

· Lean Theorem

MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding

Henry Zheng , Chenyue Fang , Rui Huang , Siyuan Wei , Xiao Liu , Gao Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:20 UTC · model grok-4.3

classification 💻 cs.CV cs.MA

keywords multi-agent systems3D scene understandinggrounded reasoningvision-language modelstraining-free methodsgeometric verificationspatial reasoningzero-shot generalization

0 comments

The pith

A multi-agent framework lets off-the-shelf vision-language models perform grounded reasoning in 3D scenes without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that dividing 3D reasoning work among specialized agents lets general-purpose vision-language models handle open-ended spatial questions in complex scenes. A planning agent breaks tasks into steps and directs the flow, a grounding agent locates relevant objects and pulls the right scene views from many observations, and a coding agent writes and runs programs to verify geometric relations. This replaces both task-specific fine-tuning and rigid hand-built pipelines with dynamic collaboration. If the claim holds, 3D understanding becomes more adaptable to new environments using only existing models.

Core claim

MAG-3D is a training-free multi-agent framework that coordinates a planning agent to decompose queries and orchestrate reasoning, a grounding agent to execute free-form 3D object identification and relevant frame retrieval from extensive scene observations, and a coding agent to conduct flexible geometric reasoning and explicit verification through executable programs, thereby enabling effective grounded reasoning across diverse 3D scenes with off-the-shelf VLMs.

What carries the argument

The multi-agent collaboration architecture with a planning agent, grounding agent, and coding agent that dynamically coordinates off-the-shelf VLMs for task decomposition, free-form scene grounding, and programmatic geometric verification.

If this is right

The system applies to novel 3D scenes in a zero-shot manner without retraining.
Open-ended queries about spatial and geometric relationships become feasible without fixed procedures.
Explicit code-based verification reduces errors in geometric checks compared to text-only reasoning.
Performance reaches state-of-the-art levels on challenging 3D reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar agent divisions could apply to related domains such as video or embodied scene reasoning.
Lower dependence on 3D-specific training data may ease deployment in resource-limited settings.
The modular design could support integration with robotic perception pipelines for real-time spatial tasks.

Load-bearing premise

Off-the-shelf vision-language models can reliably carry out accurate 3D grounding and geometric verification when guided by the planning, grounding, and coding agents without any task-specific training.

What would settle it

A 3D scene and natural-language query on which the multi-agent system returns incorrect object grounding or an erroneous spatial verification that a human observer can directly confirm as wrong.

Figures

Figures reproduced from arXiv: 2604.09167 by Chenyue Fang, Gao Huang, Henry Zheng, Rui Huang, Siyuan Wei, Xiao Liu.

**Figure 2.** Figure 2: The overall framework of MAG-3D. Given a question q and RGB observations I, the planning agent dynamically orchestrates expert agents for spatial grounding, geometric reasoning, and retrieves information from scene memory M as needed. The framework maintains explicit, inspectable intermediate results (e.g., grounded instances and geometric results), which are then aggregated and summarized into the fina… view at source ↗

**Figure 3.** Figure 3: Grounding-QA coherence on Beacon3D. Comparison of QA Score ≥ 75, Good Coherence, and R2 across state-of-the-art methods. We also compare against SceneCOT [28], which is trained on specially crafted data for 3D grounded reasoning, and observe that MAG-3D still achieves higher overall performance without in-domain training. In particular, relative to the previous best method, SceneCOT, in [PITH_FULL_IMAGE:f… view at source ↗

**Figure 4.** Figure 4: Qualitative results on Beacon3D. For each query, we show (top) the input question and RGB sequence, (middle) intermediate visual and geometric outputs, and (bottom) the predicted and ground-truth answers. 4.5 Qualitative Results [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have achieved strong performance in multimodal understanding and reasoning, yet grounded reasoning in 3D scenes remains underexplored. Effective 3D reasoning hinges on accurate grounding: to answer open-ended queries, a model must first identify query-relevant objects and regions in a complex scene, and then reason about their spatial and geometric relationships. Recent approaches have demonstrated strong potential for grounded 3D reasoning. However, they often rely on in-domain tuning or hand-crafted reasoning pipelines, which limit their flexibility and zero-shot generalization to novel environments. In this work, we present MAG-3D, a training-free multi-agent framework for grounded 3D reasoning with off-the-shelf VLMs. Instead of relying on task-specific training or fixed reasoning procedures, MAG-3D dynamically coordinates expert agents to address the key challenges of 3D reasoning. Specifically, we propose a planning agent that decomposes the task and orchestrates the overall reasoning process, a grounding agent that performs free-form 3D grounding and relevant frame retrieval from extensive 3D scene observations, and a coding agent that conducts flexible geometric reasoning and explicit verification through executable programs. This multi-agent collaborative design enables flexible training-free 3D grounded reasoning across diverse scenes and achieves state-of-the-art performance on challenging benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAG-3D splits 3D reasoning across planning, grounding, and coding agents with off-the-shelf VLMs to stay training-free, but the SOTA claim sits on experiments not visible in the abstract.

read the letter

The main thing here is a training-free setup that breaks grounded 3D reasoning into three agents: one plans the query, one grounds relevant objects and frames from scene data, and one writes executable code for geometric checks like distances or relations. This avoids fine-tuning and aims for flexibility across scenes, which is the concrete proposal worth noting. The coding agent step stands out because it tries to move beyond language-only inference to something verifiable through programs on point clouds or meshes. That design choice addresses a real limitation in prior VLM work on spatial tasks and could be useful for robotics or AR applications where precision matters. The overall framing also contrasts cleanly with methods that rely on in-domain tuning or rigid pipelines, so the paper earns credit for spelling out a dynamic coordination approach that builds directly on existing models. The soft spots are mostly around verification. The abstract asserts state-of-the-art results on challenging benchmarks, yet no numbers, ablations, or code examples appear in the provided description, which leaves the central performance claim uncheckable at this stage. The coding agent in particular carries a heavy load: if the VLM generates incorrect transformations, wrong metrics, or code that fails on edge cases in 3D data, the explicit verification step collapses and the training-free advantage disappears. That assumption is plausible but unproven here, and it matches the weakest point flagged in the stress test. The grounding and frame retrieval steps also lack enough implementation detail to judge robustness on large or cluttered scenes. This paper is for readers working on multimodal models for 3D vision who want ideas for agent-based pipelines rather than end-to-end trained systems. Someone already following VLM extensions to spatial reasoning would find the agent decomposition worth discussing, even if the results need closer inspection. It has enough of a structured idea and engagement with the literature to deserve peer review so that referees can examine the actual benchmarks, failure modes, and code generation reliability.

Referee Report

3 major / 3 minor

Summary. The paper introduces MAG-3D, a training-free multi-agent framework for grounded 3D reasoning that coordinates three off-the-shelf VLMs: a planning agent that decomposes queries and orchestrates the process, a grounding agent that performs free-form object and region identification plus frame retrieval from 3D scene observations, and a coding agent that translates geometric queries into executable programs for explicit spatial verification. The central claim is that this collaborative design enables flexible zero-shot 3D reasoning across diverse scenes and attains state-of-the-art results on challenging benchmarks without task-specific training or hand-crafted pipelines.

Significance. If the experimental results hold, the work would be significant for demonstrating that dynamic multi-agent coordination of existing VLMs can address core 3D grounding and verification challenges more flexibly than prior tuned or pipeline-based methods. The training-free nature and explicit code-based verification step represent a potentially generalizable direction for open-ended 3D understanding.

major comments (3)

[§4.2] §4.2 (Coding Agent): The central claim that the coding agent produces reliable executable programs for geometric verification is load-bearing, yet the manuscript provides no quantitative metrics on code-generation success rate, execution error frequency, or failure modes (e.g., incorrect coordinate transforms or distance computations). Without these, it is impossible to verify whether the “explicit verification” step actually contributes to the reported SOTA gains or merely passes through VLM hallucinations.
[Table 3, §5.1] Table 3 and §5.1: The SOTA performance claims are presented without ablation studies that isolate the contribution of the coding agent (e.g., replacing it with direct VLM reasoning) or the multi-agent orchestration. The performance delta over single-VLM baselines is therefore not attributable to the proposed design, weakening the causal link between the framework and the benchmark results.
[§3.3] §3.3 (Grounding Agent): The description of free-form 3D grounding and relevant-frame retrieval lacks detail on how the agent handles large-scale point clouds or meshes; no pseudocode, prompt templates, or failure-case analysis is supplied, making reproducibility and robustness assessment difficult despite the training-free claim.

minor comments (3)

[Abstract, §1] The abstract and §1 repeatedly use “state-of-the-art” without immediately citing the specific benchmarks and prior numbers; a parenthetical reference to the main results table would improve clarity.
[Figure 2] Figure 2 (agent interaction diagram) uses overlapping arrows that obscure the exact data flow between the grounding and coding agents; redrawing with clearer sequencing would aid comprehension.
[§3.1, §4.1] Notation for 3D coordinates and bounding boxes is introduced inconsistently between §3.1 and §4.1; a single unified definition table would eliminate ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical support and reproducibility of MAG-3D. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Coding Agent): The central claim that the coding agent produces reliable executable programs for geometric verification is load-bearing, yet the manuscript provides no quantitative metrics on code-generation success rate, execution error frequency, or failure modes (e.g., incorrect coordinate transforms or distance computations). Without these, it is impossible to verify whether the “explicit verification” step actually contributes to the reported SOTA gains or merely passes through VLM hallucinations.

Authors: We agree that direct quantitative evaluation of the coding agent would provide stronger evidence for its contribution. Although the end-to-end SOTA results on multiple benchmarks offer indirect support for the verification step's utility, we will add a dedicated analysis in the revision. This will include code-generation success rates, execution error frequencies, and categorized failure modes (such as coordinate transform errors) computed over the evaluation datasets, presented in a new table or subsection under §4.2. revision: yes
Referee: [Table 3, §5.1] Table 3 and §5.1: The SOTA performance claims are presented without ablation studies that isolate the contribution of the coding agent (e.g., replacing it with direct VLM reasoning) or the multi-agent orchestration. The performance delta over single-VLM baselines is therefore not attributable to the proposed design, weakening the causal link between the framework and the benchmark results.

Authors: We acknowledge that the current comparisons to single-VLM baselines in Table 3 do not fully isolate the individual components. To address this, we will perform and report additional ablation experiments in the revised §5.1 and Table 3 (or an extended table). These will include variants that replace the coding agent with direct VLM-based geometric reasoning and variants that use a single unified agent instead of the multi-agent orchestration, allowing clearer attribution of performance gains to the proposed design. revision: yes
Referee: [§3.3] §3.3 (Grounding Agent): The description of free-form 3D grounding and relevant-frame retrieval lacks detail on how the agent handles large-scale point clouds or meshes; no pseudocode, prompt templates, or failure-case analysis is supplied, making reproducibility and robustness assessment difficult despite the training-free claim.

Authors: We agree that expanded implementation details are needed for reproducibility. In the revised manuscript, we will augment §3.3 with pseudocode outlining the grounding and frame-retrieval procedure, example prompt templates provided to the off-the-shelf VLM, and a short discussion of observed failure cases when processing large-scale point clouds or meshes. These additions will clarify the training-free mechanism without altering the core claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework using external VLMs with no self-referential derivations

full rationale

The paper describes a training-free multi-agent system (planning, grounding, and coding agents) built on off-the-shelf VLMs. No equations, fitted parameters, or predictions appear in the provided text. Claims of SOTA performance rest on benchmark evaluation rather than any reduction of outputs to self-defined inputs or self-citation chains. The central design is presented as a novel coordination method whose validity is externally testable, satisfying the criteria for a self-contained, non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

Only the abstract is available, so the ledger reflects high-level design assumptions rather than detailed derivations. The framework assumes existing VLMs possess sufficient grounding and coding capabilities when properly orchestrated.

axioms (1)

domain assumption Off-the-shelf VLMs can perform free-form 3D grounding and executable geometric reasoning when coordinated by specialized agents
This is the core premise enabling the training-free claim and is invoked throughout the abstract description of the agents.

invented entities (3)

Planning agent no independent evidence
purpose: Decompose tasks and orchestrate reasoning process
Newly proposed component in the multi-agent design; no independent evidence provided beyond the abstract claim.
Grounding agent no independent evidence
purpose: Perform 3D grounding and frame retrieval
Newly proposed component; no independent evidence beyond abstract.
Coding agent no independent evidence
purpose: Conduct geometric reasoning via executable programs
Newly proposed component; no independent evidence beyond abstract.

pith-pipeline@v0.9.0 · 5544 in / 1312 out tokens · 51541 ms · 2026-05-10T17:20:22.599313+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MAG-3D dynamically coordinates expert agents... planning agent that decomposes the task... grounding agent that performs free-form 3D grounding... coding agent that conducts flexible geometric reasoning and explicit verification through executable programs.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

training-free multi-agent framework... off-the-shelf VLMs... no task-specific training or hand-crafted pipelines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 32 canonical work pages · 7 internal anchors

[1]

Anthropic: Claude 3.5 sonnet (Jun 2024),https://www.anthropic.com/news/ claude-3-5-sonnet

2024
[2]

arXiv (2025),https://ai.meta.com/ research/publications/locate- 3d- real- world- object- localization- via- self-supervised-learning-in-3d

Arnaud*, S., McVay*, P., Ada Martin*, A.M., Jatavallabhula, K.M., Thomas, P., Partsey, R., Dugas, D., Gejji, A., Sax, A., Berges, V.P., Henaff, M., Jain, A., Cao, A., Prasad, I., Kalakrishnan, M., Rabbat, M., Ballas, N., Assran, M., Maksymets, O., Rajeswaran, A., Meier, F.: Locate 3d: Real-world object local- ization via self-supervised learning in 3d. ar...

2025
[3]

ByteDance Seed Team: Seed1.6 (Jun 2025),https://seed.bytedance.com/en/ seed1_6

2025
[4]

Scaling spatial intelligence with multi- modal foundation models.arXiv preprint arXiv:2511.13719,

Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y., Yin, W., Yang, Z., Wei, C., Sun, Q., Zhou, T., Li, J., Pang, H.E., Qian, O., Wei, Y., Lin, Z., Shi, X., Deng, K., Han, X., Chen, Z., Fan, X., Deng, H., Lu, L., Pan, L., Li, B., Liu, Z., Wang, Q., Lin, D., Yang, L.: Scaling spatial intelligence with multimodal foundation models. arXiv preprint arXiv:2511...

work page arXiv 2025
[5]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala,K.V.,Khedr,H.,Huang,A.,etal.:Sam3:Segmentanythingwithconcepts. arXiv preprint arXiv:2511.16719 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

In: CVPR

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: CVPR. pp. 14455–14465 (2024)

2024
[7]

In: CVPR

Chen, S., Chen, X., Zhang, C., Li, M., Yu, G., Fei, H., Zhu, H., Fan, J., Chen, T.: Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In: CVPR. pp. 26428–26438 (2024)

2024
[8]

Geometrically-constrained agent for spatial reasoning.arXiv preprint arXiv:2511.22659, 2025

Chen, Z., Lu, X., Zheng, Z., Li, P., He, L., Zhou, Y., Shao, J., Zhuang, B., Sheng, L.: Geometrically-constrained agent for spatial reasoning. arXiv preprint arXiv:2511.22659 (2025)

work page arXiv 2025
[9]

Think with 3d: Geometric imagina- tion grounded spatial reasoning from limited views.arXiv preprint arXiv:2510.18632, 2025

Chen, Z., Zhang, M., Yu, X., Luo, X., Sun, M., Pan, Z., Feng, Y., Pei, P., Cai, X., Huang, R.: Think with 3d: Geometric imagination grounded spatial reasoning from limited views. arXiv preprint arXiv:2510.18632 (2025)

work page arXiv 2025
[10]

NeurIPS37, 135062–135093 (2024)

Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models. NeurIPS37, 135062–135093 (2024)

2024
[11]

DeepSeek-AI, Liu, A., Mei, A., Lin, B., Xue, B., Wang, B., Xu, B., Wu, B., Zhang, B., Lin, C., Dong, C., Lu, C., Zhao, C., Deng, C., Xu, C., Ruan, C., Dai, D., Guo, D., Yang, D., Chen, D., Li, E., Zhou, F., Lin, F., Dai, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu, H., Li, H., Liang, H., Wei, H., Zhang, H., Luo, H., Ji, H., Ding, H., Tang, H., Cao, H., G...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

org/abs/2501.01163

Deng, J., He, T., Jiang, L., Wang, T., Dayoub, F., Reid, I.: 3d-llava: Towards generalist 3d lmms with omni superpoint transformer (2025),https://arxiv. org/abs/2501.01163

work page arXiv 2025
[13]

Fan, S., Cui, J., Guo, M.H., Yang, S.: Tool-augmented spatiotemporal reasoning for streamlining video question answering task (2025),https://arxiv.org/abs/ 2512.10359

work page arXiv 2025
[14]

In: European Conference on Computer Vision

Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: European Conference on Computer Vision. pp. 75–92. Springer (2025)

2025
[15]

Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., Xu, H., Theiss, J., Chen, T., Li, J., Tu, Z., Wang, Z., Ranjan, R.: Vlm- 3r: Vision-language models augmented with instruction-aligned 3d reconstruction (2025),https://arxiv.org/abs/2505.20279

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Visual Programming: Compositional visual reasoning without training

Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. ArXivabs/2211.11559(2022)

work page arXiv 2022
[17]

arXiv preprint arXiv:2510.07181 (2025)

Han, Y., Chi, C., Zhou, E., Rong, S., An, J., Wang, P., Wang, Z., Sheng, L., Zhang, S.: Tiger: Tool-integrated geometric reasoning in vision-language models for robotics. arXiv preprint arXiv:2510.07181 (2025)

work page arXiv 2025
[18]

In: CVPR

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. In: CVPR. pp. 14281–14290 (2024)

2024
[19]

Hu, Y., Shi, W., Fu, X., Roth, D., Ostendorf, M., Zettlemoyer, L., Smith, N.A., Kr- ishna, R.: Visual sketchpad: Sketching as a visual chain of thought for multimodal language models (2024),https://arxiv.org/abs/2406.09403

work page arXiv 2024
[20]

NeurIPS (2024)

Huang, H., Chen, Y., Wang, Z., Huang, R., Xu, R., Wang, T., Liu, L., Cheng, X., Zhao, Y., Pang, J., et al.: Chat-scene: Bridging 3d scene and large language models with object identifiers. NeurIPS (2024)

2024
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 18 H

Huang, J., Jia, B., Wang, Y., Zhu, Z., Linghu, X., Li, Q., Zhu, S.C., Huang, S.: Un- veiling the mist over 3d vision-language understanding: Object-centric evaluation with chain-of-analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 18 H. Zheng et al

2025
[22]

In: ICML

Huang, J., Yong, S., Ma, X., Linghu, X., Li, P., Wang, Y., Li, Q., Zhu, S.C., Jia, B., Huang, S.: An embodied generalist agent in 3d world. In: ICML. pp. 20413–20451 (2024)

2024
[23]

Huang, S.Y., Choe, J., Wang, Y.C.F., Sun, C.: Openvoxel: Training-free grouping and captioning voxels for open-vocabulary 3d scene understanding (2026),https: //arxiv.org/abs/2601.09575

work page arXiv 2026
[24]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

In: ECCV

Jia, B., Chen, Y., Yu, H., Wang, Y., Niu, X., Liu, T., Li, Q., Huang, S.: Sceneverse: Scaling 3d vision-language learning for grounded scene understanding. In: ECCV. pp. 289–310. Springer (2024)

2024
[26]

Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding (2025),https: //arxiv.org/abs/2504.14920

work page arXiv 2025
[27]

Advances in Neural Information Processing Systems (2024)

Linghu, X., Huang, J., Niu, X., Ma, X., Jia, B., Huang, S.: Multi-modal situated reasoning in 3d scenes. Advances in Neural Information Processing Systems (2024)

2024
[28]

In: Proceedings of the International Con- ference on Learning Representations (ICLR) (2026)

Linghu, X., Huang, J., Zhu, Z., Jia, B., Huang, S.: Scenecot: Eliciting grounded chain-of-thought reasoning in 3d scenes. In: Proceedings of the International Con- ference on Learning Representations (ICLR) (2026)

2026
[29]

NeurIPS36, 34892– 34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS36, 34892– 34916 (2023)

2023
[30]

Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning.arXiv preprint arXiv:2510.27606, 2025

Liu, Y., Zhang, B., Zang, Y., Cao, Y., Xing, L., Dong, X., Duan, H., Lin, D., Wang, J.: Spatial-ssrl: Enhancing spatial understanding via self-supervised reinforcement learning. arXiv preprint arXiv:2510.27606 (2025)

work page arXiv 2025
[31]

In: ICLR (2026),https://openreview.net/forum?id=yv15C8ql24

Luo, Z., Zhang, C., Yong, S., Dai, C., Wang, Q., Ran, H., Shi, G., Sycara, K., Xie, Y.: pyspatial: Generating 3d visual programs for zero-shot spatial reasoning. In: ICLR (2026),https://openreview.net/forum?id=yv15C8ql24

2026
[32]

OpenAI: Introducing chatgpt and whisper apis (Apr 2024),https://openai.com/ index/introducing-chatgpt-and-whisper-apis/

2024
[33]

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.L., Brockman, ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

Ouyang, K., Liu, Y., Wu, H., Liu, Y., Zhou, H., Zhou, J., Meng, F., Sun, X.: Spacer: Reinforcing mllms in video spatial reasoning. arXiv preprint arXiv:2504.01805 (2025)

work page arXiv 2025
[35]

Qi, Z., Zhang, Z., Fang, Y., Wang, J., Zhao, H.: Gpt4scene: Understand 3d scenes from videos with vision-language models (2025),https://arxiv.org/abs/2501. 01428

2025
[36]

ai / blog ? from = research

QwenTeam: Qwen3-coder: Agentic coding in the world (Jul 2025), https : / / qwen . ai / blog ? from = research . research - list & id = d927d7d2e59d059045ce758ded34f98c0186d2d7

2025
[37]

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach themselves to use tools (2023),https://arxiv.org/abs/2302.04761

work page internal anchor Pith review arXiv 2023
[38]

org/10.48550/arXiv.2210.03105

Schult, J., Engelmann, F., Hermans, A., Litany, O., Tang, S., Leibe, B.: Mask3d: Mask transformer for 3d semantic instance segmentation (2023),https://arxiv. org/abs/2210.03105

work page arXiv 2023
[39]

In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Si, C., Zhang, Y., Li, R., Yang, Z., Liu, R., Yang, D.: Design2code: Benchmarking multimodal code generation for automated front-end engineering. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 3956–3974 (2025)

2025
[40]

Openthinkimg: Learning to think with images via visual tool reinforcement learning.arXiv preprint arXiv:2505.08617, 2025

Su, Z., Li, L., Song, M., Hao, Y., Yang, Z., Zhang, J., Chen, G., Gu, J., Li, J., Qu, X., et al.: Openthinkimg: Learning to think with images via visual tool reinforce- ment learning. arXiv preprint arXiv:2505.08617 (2025)

work page arXiv 2025
[41]

arXiv preprint arXiv:2504.01901 (2025) 20 H

Wang, H., Zhao, Y., Wang, T., Fan, H., Zhang, X., Zhang, Z.: Ross3d: Reconstruc- tive visual instruction tuning with 3d-awareness. arXiv preprint arXiv:2504.01901 (2025) 20 H. Zheng et al

work page arXiv 2025
[42]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2025)

2025
[43]

Wang, Y., Chen, Q., Li, Z., Wang, S., Guo, S., Zhang, Z., Wei, Z.: Simple o3: Towards interleaved vision-language reasoning (2025),https://arxiv.org/abs/ 2508.12109

work page arXiv 2025
[45]

V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135 (2023)

work page arXiv 2023
[46]

Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces. arXiv preprint arXiv:2412.14171 (2024)

work page arXiv 2024
[47]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

Yang, R., Zhu, Z., Li, Y., Huang, J., Yan, S., Zhou, S., Liu, Z., Li, X., Li, S., Wang, W., et al.: Visual spatial tuning. arXiv preprint arXiv:2511.05491 (2025)

work page arXiv 2025
[48]

In: International Conference on Learning Representations (ICLR) (2023)

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., Cao, Y.: ReAct: Synergizing reasoning and acting in language models. In: International Conference on Learning Representations (ICLR) (2023)

2023
[49]

Yu, H., Li, W., Wang, S., Chen, J., Zhu, J.: Inst3d-lmm: Instance-aware 3d scene understanding with multi-modal instruction tuning (2025),https://arxiv.org/ abs/2503.00513

work page arXiv 2025
[50]

Yuan, H., Liu, Z., Zhou, J., Qian, H., Shu, Y., Sebe, N., Wen, J.R., Dou, Z.: Think with videos for agentic long-video understanding (2025),https://arxiv.org/abs/ 2506.10821

work page arXiv 2025
[51]

Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

Zhang, H., Gu, X., Li, J., Ma, C., Bai, S., Zhang, C., Zhang, B., Zhou, Z., He, D., Tang, Y.: Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning. arXiv preprint arXiv:2508.04416 (2025)

work page arXiv 2025
[52]

From flatland to space: Teaching vision-language models to perceive and reason in 3d.arXiv preprint arXiv:2503.22976, 2025

Zhang, J., Chen, Y., Zhou, Y., Xu, Y., Huang, Z., Mei, J., Chen, J., Yuan, Y., Cai, X., Huang, G., Quan, X., Xu, H., Zhang, L.: From flatland to space: Teaching vision-language models to perceive and reason in 3d. arXiv preprint arXiv:2503.22976 (2025)

work page arXiv 2025
[53]

Deep video dis- covery: Agentic search with tool use for long-form video understanding.arXiv preprint arXiv:2505.18079, 2025

Zhang, X., Jia, Z., Guo, Z., Li, J., Li, B., Li, H., Lu, Y.: Deep video discovery: Agentic search with tool use for long-form video understanding. arXiv preprint arXiv:2505.18079 (2025)

work page arXiv 2025
[54]

Zheng, D., Huang, S., Wang, L.: Video-3d llm: Learning position-aware video repre- sentation for 3d scene understanding (2024),https://arxiv.org/abs/2412.00493

work page arXiv 2024
[55]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing "thinking with images" via reinforcement learning (2025), https://arxiv.org/abs/2505.14362

work page internal anchor Pith review arXiv 2025
[56]

In: Proceedings IEEE Interna- tional Conference on Computer Vision (ICCV) (2025)

Zhu, C., Wang, T., Zhang, W., Pang, J., Liu, X.: Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. In: Proceedings IEEE Interna- tional Conference on Computer Vision (ICCV) (2025)

2025
[57]

In: ICCV

Zhu, Z., Ma, X., Chen, Y., Deng, Z., Huang, S., Li, Q.: 3d-vista: Pre-trained trans- former for 3d vision and text alignment. In: ICCV. pp. 2911–2921 (2023)

2023
[58]

In: ECCV

Zhu, Z., Zhang, Z., Ma, X., Niu, X., Chen, Y., Jia, B., Deng, Z., Huang, S., Li, Q.: Unifying 3d vision-language understanding via promptable queries. In: ECCV. pp. 188–206. Springer (2024) MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding A1 Appendix Contents of this appendix A Grounding Agent Implementation Details ......................... A1...

2024