Recognition: unknown
SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
Pith reviewed 2026-05-09 22:49 UTC · model grok-4.3
The pith
SpatiO coordinates heterogeneous vision-language models with test-time reliability scoring to improve spatial reasoning without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SpatiO is a heterogeneous multi-agent framework that coordinates multiple vision-language specialists with complementary inductive biases and applies Test-Time Orchestration (TTO) to evaluate and reweight each agent's contribution according to its observed reliability on the given input, without any parameter updates, producing consistent gains on 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench over both closed-source and open-source baselines.
What carries the argument
Test-Time Orchestration (TTO), the inference-time procedure that scores each agent's reliability from its behavior on the input and uses those scores to blend outputs from heterogeneous vision-language models.
If this is right
- Heterogeneous agents outperform homogeneous ones because they supply a wider range of spatial inductive biases.
- Dynamic reweighting at inference supplies adaptability that fixed single-pipeline models lack.
- The same orchestration works on both closed-source and open-source vision-language models.
- Gains appear across benchmarks that mix 2D appearance cues with 3D geometric constraints.
Where Pith is reading between the lines
- The same reliability-scoring idea could be applied to other vision tasks where the usefulness of different cues varies by input, such as counting or occlusion reasoning.
- Smaller specialized models might contribute more when their outputs are selectively amplified rather than averaged with larger general models.
- Replacing hand-designed reliability signals with a small learned predictor could reduce the number of forward passes required during orchestration.
Load-bearing premise
Multiple vision-language models possess sufficiently different spatial reasoning behaviors that a signal observable at test time can identify which ones are trustworthy for the current input.
What would settle it
On a benchmark where every model produces identical spatial errors, reweighting by any test-time reliability signal would yield no accuracy gain over the single best agent.
Figures
read the original abstract
Understanding visual scenes requires not only recognizing objects but also reasoning about their spatial relationships. Unlike general vision-language tasks, spatial reasoning requires integrating multiple inductive biases, such as 2D appearance cues, depth signals, and geometric constraints, whose reliability varies across contexts. This suggests that effective spatial reasoning requires \emph{spatial adaptability}: the ability to flexibly coordinate different reasoning strategies depending on the input. However, most existing approaches rely on a single reasoning pipeline that implicitly learns a fixed spatial prior, limiting their ability to adapt under distribution changes. Multi-agent systems offer a promising alternative by aggregating diverse reasoning trajectories, but prior attempts in spatial reasoning primarily employ homogeneous agents, restricting the diversity of inductive biases they can leverage. In this work, we introduce SpatiO, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. To enable effective collaboration, we propose Test-Time Orchestration (TTO), an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference, without modifying model parameters. Extensive experiments on diverse spatial reasoning benchmarks, including 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench, demonstrate that SpatiO consistently improves spatial reasoning performance over both closed-source and open-source baselines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SpatiO, a heterogeneous multi-agent framework for spatial reasoning that coordinates multiple vision-language specialists with complementary inductive biases. It proposes Test-Time Orchestration (TTO), an optimization mechanism that dynamically evaluates and reweights agents based on their observed reliability during inference without modifying model parameters. The central claim is that SpatiO yields consistent performance gains over closed-source and open-source baselines on the spatial reasoning benchmarks 3DSRBench, STVQA-7k, CV-Bench, and Omni3D-Bench.
Significance. If the empirical gains are shown to arise from adaptive reweighting rather than static ensembling and if the reliability proxy is validated, the work would offer a practical test-time approach to leveraging model diversity for tasks whose inductive biases vary by input. The emphasis on heterogeneous agents and parameter-free adaptation at inference time is a clear strength relative to single-pipeline or homogeneous multi-agent baselines.
major comments (3)
- [Abstract and §3] Abstract and §3 (TTO description): the reliability signal is described only as 'observed reliability during inference' with no explicit definition, equation, or algorithm (e.g., consistency across views, entropy, or geometric consistency check). Without this, it is impossible to determine whether the proxy is calibrated to per-instance spatial accuracy or whether reported gains could be explained by model diversity alone.
- [§4] §4 (Experiments): no ablation is reported that isolates TTO from static averaging or random selection of the same agent pool, nor are per-benchmark numbers, error bars, or statistical significance tests supplied. This leaves the central claim that adaptive orchestration (rather than extra inference budget) drives the improvements unverified.
- [§3.1] §3.1 (Agent selection): the specific complementary inductive biases assigned to each specialist (e.g., depth vs. 2D appearance) are not enumerated or justified, so it is unclear whether the heterogeneity is load-bearing or whether any diverse set would suffice.
minor comments (2)
- [Abstract] The abstract would benefit from a single sentence summarizing the magnitude of the reported gains and the number of agents used.
- [§3] Notation for the reweighting function in TTO should be introduced with a clear equation rather than prose only.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help clarify key aspects of our work. We address each major comment point by point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (TTO description): the reliability signal is described only as 'observed reliability during inference' with no explicit definition, equation, or algorithm (e.g., consistency across views, entropy, or geometric consistency check). Without this, it is impossible to determine whether the proxy is calibrated to per-instance spatial accuracy or whether reported gains could be explained by model diversity alone.
Authors: We agree that the current description of the reliability signal is high-level. The manuscript does not provide an explicit equation or algorithm in the abstract or §3. We will revise §3 to include a precise definition of the reliability proxy, along with the corresponding equation and pseudocode for the TTO procedure, to make clear how it is computed and its relation to per-instance accuracy. revision: yes
-
Referee: [§4] §4 (Experiments): no ablation is reported that isolates TTO from static averaging or random selection of the same agent pool, nor are per-benchmark numbers, error bars, or statistical significance tests supplied. This leaves the central claim that adaptive orchestration (rather than extra inference budget) drives the improvements unverified.
Authors: We acknowledge that the experiments section does not contain ablations isolating the adaptive reweighting mechanism of TTO from static averaging or random selection, nor does it report error bars or statistical significance tests. We will add these ablations, per-benchmark breakdowns, error bars, and significance tests in the revised §4 to verify that the observed gains arise specifically from the adaptive orchestration. revision: yes
-
Referee: [§3.1] §3.1 (Agent selection): the specific complementary inductive biases assigned to each specialist (e.g., depth vs. 2D appearance) are not enumerated or justified, so it is unclear whether the heterogeneity is load-bearing or whether any diverse set would suffice.
Authors: We thank the referee for this observation. Section 3.1 introduces the heterogeneous agents but does not enumerate or justify the specific inductive biases assigned to each specialist. We will revise §3.1 to explicitly list and justify these biases (e.g., depth estimation, 2D appearance, and geometric constraints) and explain why this particular heterogeneity is important for spatial reasoning. revision: yes
Circularity Check
No circularity: method is an independent orchestration layer with no reducing derivations
full rationale
The paper presents SpatiO as a heterogeneous multi-agent framework using Test-Time Orchestration (TTO) to dynamically reweight vision-language agents based on observed reliability at inference time, without parameter updates. No equations, derivations, or mathematical claims appear in the abstract or method description that could reduce a 'prediction' or result to fitted inputs, self-definitions, or self-citations by construction. The central claims rest on empirical improvements across benchmarks rather than any closed-form chain or uniqueness theorem imported from prior author work. No ansatzes are smuggled via citation, and no known results are renamed as novel organization. The approach is self-contained as an external coordination mechanism whose validity is evaluated externally via benchmark gains.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Vision-language models possess complementary inductive biases (2D appearance, depth, geometric constraints) whose reliability varies by input.
- domain assumption Test-time reliability signals can be observed and used to reweight agents without modifying model parameters.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2506.19502 (2025)
Algazinov, A., Laing, M., Laban, P.: Mate: Llm-powered multi-agent translation environment for accessibility applications. arXiv preprint arXiv:2506.19502 (2025)
-
[2]
Anthropic: Claude opus 4.6.https://www.anthropic.com/claude(2025), ac- cessed: 2026-03-05
2025
-
[3]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Batra, H., Tu, H., Chen, H., Lin, Y., Xie, C., Clark, R.: Spatialthinker: Reinforcing 3d reasoning in multimodal llms via spatial rewards. arXiv:2511.07403 (2025)
-
[5]
Bochkovskii, A., Delaunoy, A., Germain, H., Santos, M., Zhou, Y., Richter, S.R., Koltun, V.: Depth pro: Sharp monocular metric depth in less than a second (2024)
2024
-
[6]
In: CVPR (2023)
Brazil, G., Kumar, A., Straub, J., Ravi, N., Johnson, J., Gkioxari, G.: Omni3d: A large benchmark and model for 3d object detection in the wild. In: CVPR (2023)
2023
-
[7]
In: CVPR (2024)
Chen, B., Xu, Z., Kirmani, S., Ichter, B., Sadigh, D., Guibas, L., Xia, F.: Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. In: CVPR (2024)
2024
-
[8]
ArXiv (2025)
Chen, Z., Lu, X., Zheng, Z., Li, P., He, L., Zhou, Y., Shao, J., Zhuang, B., Sheng, L.: Geometrically-constrained agent for spatial reasoning. ArXiv (2025)
2025
-
[9]
arXiv (2024)
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv (2024)
2024
-
[10]
Cheng, A.C., Yin, H., Fu, Y., Guo, Q., Yang, R., Kautz, J., Wang, X., Liu, S.: Spatialrgpt: Grounded spatial reasoning in vision-language models (2024)
2024
-
[11]
Choi, H.K., et al.: Debate or vote: Which yields better decisions in multi-agent large language models? In: Advances in Neural Information Processing Systems (NeurIPS) (2025), spotlight
2025
-
[12]
ArXiv (2024)
Chu, X., Qiao, L., Zhang, X., Xu, S., Wei, F., Yang, Y., Sun, X., Hu, Y., Lin, X., Zhang, B., et al.: Mobilevlm v2: Faster and stronger baseline for vision language model. ArXiv (2024)
2024
-
[13]
arXiv preprint arXiv:2205.12880 (2022)
Fung, H.L., Darvariu, V.A., Hailes, S., Musolesi, M.: Trust-based consensus in multi-agent reinforcement learning systems. arXiv preprint arXiv:2205.12880 (2022)
-
[14]
arXiv (2025)
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv (2025)
2025
-
[15]
In: CVPR (2023)
Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: CVPR (2023)
2023
-
[16]
In: CVPR (2023)
Gupta, T., et al.: Conceptgraphs: Open-vocabulary 3d scene graphs for situated reasoning. In: CVPR (2023)
2023
-
[17]
Hallyburton, R.S., Pajic, M.: Bayesian methods for trust in collaborative multi- agent autonomy (2024)
2024
-
[18]
International journal of computer vision (2017)
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision (2017)
2017
-
[19]
ArXiv (2025) SpatiO 35
Li, H., Li, D., Wang, Z., Yan, Y., Wu, H., Zhang, W., Shen, Y., Lu, W., Xiao, J., Zhuang, Y.: Spatialladder: Progressive training for spatial reasoning in vision- language models. ArXiv (2025) SpatiO 35
2025
-
[20]
International Journal of Computer Vision133(1), 31–64 (2025)
Liang, J., He, R., Tan, T.: A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision133(1), 31–64 (2025)
2025
-
[21]
In: ECCV (2014)
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
2014
-
[22]
In: ICCV (2025)
Ma, W., Chen, H., Zhang, G., Chou, Y.C., Chen, J., de Melo, C., Yuille, A.: 3dsrbench: A comprehensive 3d spatial reasoning benchmark. In: ICCV (2025)
2025
-
[23]
Ma, W., Chou, Y.C., Liu, Q., Wang, X., de Melo, C., Xie, J., Yuille, A.: Spatial- reasoner: Towards explicit and generalizable 3d spatial reasoning. arXiv preprint arXiv:2504.20024 (2025)
-
[24]
In: CVPR (2025)
Ma, W., Ye, L., de Melo, C.M., Yuille, A., Chen, J.: Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. In: CVPR (2025)
2025
-
[25]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Marsili, D., Agrawal, R., Yue, Y., Gkioxari, G.: Visual agentic ai for spatial rea- soning with a dynamic api. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19446–19455 (2025)
2025
-
[26]
OpenAI: Introducing gpt-5.2.https://openai.com/index/ introducing-gpt-5-2/(2025), accessed: 2026-03-05
2025
-
[27]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision (2023)
2023
-
[28]
SAM 2: Segment Anything in Images and Videos
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv:2408.00714 (2024)
work page internal anchor Pith review arXiv 2024
-
[29]
ArXiv (2017)
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ArXiv (2017)
2017
-
[30]
ArXiv (2025)
Stogiannidis, I., McDonagh, S., Tsaftaris, S.A.: Mind the gap: Diagnosing spatial reasoning failures in vision-language models. ArXiv (2025)
2025
-
[31]
In: International conference on machine learning
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., Hardt, M.: Test-time training with self-supervision for generalization under distribution shifts. In: International conference on machine learning. pp. 9229–9248. PMLR (2020)
2020
-
[32]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Sur´ ıs, D., Menon, S., Vondrick, C.: ViperGPT: Visual inference via Python execu- tion for reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11854–11864 (2023)
2023
-
[33]
Tong, P., Brown, E., Wu, P., Woo, S., Iyer, A.J.V., Akula, S.C., Yang, S., Yang, J., Middepogu, M., Wang, Z., et al.: Cambrian-1: A fully open, vision-centric ex- ploration of multimodal llms (2024)
2024
-
[34]
Mixture-of-Agents Enhances Large Language Model Capabilities
Wang, J., Wang, J., Athiwaratkun, B., Zhang, C., Zou, J.: Mixture-of-agents en- hances large language model capabilities. arXiv preprint arXiv:2406.04692 (2024)
work page internal anchor Pith review arXiv 2024
-
[35]
ArXiv (2025)
Xue, Q., Liu, W., Wang, S., Wang, H., Wu, Y., Gao, W.: Reasoning path and latent state analysis for multi-view visual spatial reasoning: A cognitive science perspective. ArXiv (2025)
2025
-
[36]
arXiv preprint arXiv:2502.18873 (2025)
Yang, S., Li, Y., Lam, W., Cheng, Y.: Multi-llm collaborative search for complex problem solving. arXiv preprint arXiv:2502.18873 (2025)
-
[37]
Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence (2025)
2025
-
[38]
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., et al.: Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv arXiv:2501.04001 (2025) 36 C. Y. Hwang et al
work page internal anchor Pith review arXiv 2025
-
[39]
Zhang, W., Zhou, Z., Zeng, X., Liu, X., Fang, J., Gao, C., Li, Y., Cui, J., Chen, X., Zhang, X.P.: Open3d-vqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space (2025)
2025
-
[40]
International journal of computer vision (2019)
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International journal of computer vision (2019)
2019
-
[41]
Zhou, H., Lee, G.H.: Llava-4d: Embedding spatiotemporal prompt into lmms for 4d scene understanding. arXiv:2505.12253 (2025)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.