Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views
Pith reviewed 2026-06-26 08:36 UTC · model grok-4.3
The pith
DR-MV3D supplies dense rewards from global map consistency and local view trajectories to supervise multi-view 3D visual question answering instead of sparse final-answer signals.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DR-MV3D decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. It supplies a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models and a local trajectory reward that supervises ordered viewpoint selection, then optimizes the full pipeline with trajectory-level policy optimization to make intermediate steps learnable without manual annotations.
What carries the argument
DR-MV3D framework that decomposes MV3D-VQA into global map construction, view-trajectory planning, and egocentric grounding, then supplies a global consistency reward from frozen 3D model pseudo targets plus a local trajectory reward, optimized via GRPO.
If this is right
- The decomposition allows intermediate map construction and view planning steps to receive direct supervision without manual annotations.
- Global consistency and local trajectory rewards together produce more coherent cross-view reasoning than answer-level signals alone.
- Trajectory-level policy optimization enables end-to-end training of the full map-to-answer pipeline.
- The same dense-reward structure yields measurable gains across three distinct MV3D-VQA benchmarks.
Where Pith is reading between the lines
- The reliance on frozen foundation models for pseudo targets suggests the method could be applied to other spatial reasoning domains where reliable 3D geometry can be precomputed.
- If the pseudo-target generators contain scene-type biases, performance on out-of-distribution environments may degrade even when in-distribution benchmarks improve.
- Replacing the frozen models with jointly trained 3D components might remove one source of potential error propagation while preserving the dense-reward structure.
Load-bearing premise
Geometry-consistent pseudo targets generated by frozen 3D vision foundation models supply reliable supervision for the global consistency reward without introducing systematic errors that propagate to final answer accuracy.
What would settle it
An ablation that removes the dense rewards and trains only with answer-level supervision, resulting in performance on MindCube, VSI-Bench, and BLINK (MV) dropping to the level of the multi-image baselines, would falsify the effectiveness of process-level dense supervision.
Figures
read the original abstract
Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DR-MV3D, a map-grounded framework for multi-view 3D visual question answering (MV3D-VQA) that decomposes the task into allocentric global map construction, question-conditioned view-trajectory planning, and egocentric grounding. It supplies dense rewards via a global consistency term that aligns intermediate maps to pseudo targets produced by frozen models (VGGT + SAM3) together with a local trajectory reward, and optimizes the pipeline end-to-end with GRPO. Experiments on MindCube, VSI-Bench, and BLINK (MV) are reported to show consistent gains over strong multi-image baselines, which the authors attribute to the use of process-level dense supervision.
Significance. If the pseudo targets can be shown to supply accurate, unbiased geometry signals, the work would supply concrete evidence that dense, verifiable process rewards can improve cross-view consistency and view selection in multimodal LLMs beyond sparse answer-level training. The explicit decomposition into global-map and local-trajectory stages is a useful organizing principle. At present, however, the absence of any quantitative check on pseudo-label fidelity leaves open the possibility that reported gains arise from alignment with the frozen models rather than genuine 3D reasoning gains.
major comments (2)
- [Abstract] Abstract: the central claim that 'DR-MV3D consistently improves over strong multi-image baselines' and that this supports 'the effectiveness of process-level dense supervision' rests on unshown experimental controls; no numerical deltas, ablation tables, error bars, or statistical tests are supplied, rendering the headline result impossible to evaluate.
- [Abstract] Abstract and implied Experiments section: the global consistency reward is defined directly in terms of pseudo targets from VGGT + SAM3, yet no alignment error, consistency metric, or comparison against available 3D ground truth on MindCube/VSI-Bench subsets is reported. Without such a check, it is impossible to rule out that the optimization simply teaches the model to reproduce biases of the frozen foundation models rather than to acquire independent 3D reasoning.
minor comments (1)
- [Abstract] The acronym MV3D-VQA is introduced in the abstract without an explicit expansion on first use, although context makes the meaning recoverable.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments, which highlight important aspects of clarity in the abstract and validation of our pseudo-target approach. We will revise the manuscript to strengthen these areas.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'DR-MV3D consistently improves over strong multi-image baselines' and that this supports 'the effectiveness of process-level dense supervision' rests on unshown experimental controls; no numerical deltas, ablation tables, error bars, or statistical tests are supplied, rendering the headline result impossible to evaluate.
Authors: We agree that the abstract should provide more concrete evidence for the claims. Although the full manuscript contains detailed experimental results including performance tables, ablations, and comparisons in the Experiments section, the abstract itself does not include numerical deltas or references to them. We will revise the abstract to incorporate key quantitative results, such as accuracy improvements on MindCube, VSI-Bench, and BLINK (MV), along with mentions of ablation studies and statistical significance where applicable. This will allow readers to evaluate the headline results directly. revision: yes
-
Referee: [Abstract] Abstract and implied Experiments section: the global consistency reward is defined directly in terms of pseudo targets from VGGT + SAM3, yet no alignment error, consistency metric, or comparison against available 3D ground truth on MindCube/VSI-Bench subsets is reported. Without such a check, it is impossible to rule out that the optimization simply teaches the model to reproduce biases of the frozen foundation models rather than to acquire independent 3D reasoning.
Authors: We acknowledge this limitation in the current manuscript. The global consistency reward relies on pseudo targets from VGGT and SAM3 without an explicit fidelity check against ground truth. To address the referee's concern, we will add a quantitative analysis in the revised manuscript, including alignment error metrics and comparisons to 3D ground truth on available subsets of MindCube and VSI-Bench. This will help substantiate that the rewards encourage genuine 3D reasoning. We will also discuss potential biases and how the local trajectory reward and end-to-end optimization mitigate them. revision: yes
Circularity Check
No circularity: rewards and gains rest on external frozen models and benchmarks
full rationale
The derivation decomposes MV3D-VQA into map construction, trajectory planning, and grounding, then supplies rewards from geometry-consistent pseudo targets produced by independent frozen models (VGGT + SAM3) and optimizes via GRPO. These components are external to the present work and not defined in terms of the target predictions or fitted to the final answer accuracy. Gains are measured on separate held-out benchmarks (MindCube, VSI-Bench, BLINK) rather than by construction from the same signals. No self-citations, self-definitional equations, or fitted-input renamings appear in the provided derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frozen 3D vision foundation models (VGGT + SAM3) produce geometry-consistent pseudo targets suitable for supervising global map construction without manual annotation.
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE interna- tional conference on computer vision
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE interna- tional conference on computer vision. pp. 2425–2433 (2015) 6
2015
-
[2]
arXiv preprint arXiv:2511.13719 (2025) 10, 11
Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y., Yin, W., Yang, Z., Wei, C., Sun, Q., et al.: Scaling spatial intelligence with multimodal foundation models. arXiv preprint arXiv:2511.13719 (2025) 10, 11
-
[3]
SAM 3: Segment Anything with Concepts
Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025) 4, 7, 10, 5, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
arXiv preprint arXiv:2402.00782 (2024) 3, 8
Chan, A.J., Sun, H., Holt, S., Van Der Schaar, M.: Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782 (2024) 3, 8
-
[5]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., Gu, L., Wang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., Wang, J., Jiang, T., Wang, B., He, C., Shi, B., Zhang, X., Lv, H., Wang, Y., Shao, W., Chu, P., Tu, Z., He, T., Wu, Z., Deng, H., Ge, J., Chen, K., Zhang, K., Wang, L., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D.,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 1, 10, 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 10, 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
arXiv preprint arXiv:2503.07065 (2025) 6
Deng, H., Zou, D., Ma, R., Luo, H., Cao, Y., Kang, Y.: Boosting the gen- eralization and reasoning of vision language models with curriculum rein- forcement learning. arXiv preprint arXiv:2503.07065 (2025) 6
-
[9]
arXiv e-prints pp
Deng, Y., Bansal, H., Yin, F., Peng, N., Wang, W., Chang, K.W.: Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement. arXiv e-prints pp. arXiv–2503 (2025) 6
2025
-
[10]
PaLM-E: An Embodied Multimodal Language Model
Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., Xu, H., Theiss, J., Chen, T., Li, J., Tu, Z., Wang, Z., Ranjan, R.: Vlm-3r: Vision-language models augmented with instruction- aligned 3d reconstruction (2025),https://arxiv.org/abs/2505.2027912 Dense Reward for Multi-View 3D Reasoning 17
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
In: European Conference on Computer Vision
Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive. In: European Conference on Computer Vision. pp. 148–
-
[13]
Springer (2024) 2, 5, 6, 9, 7
2024
-
[14]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017) 6
2017
-
[15]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 1, 5, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
arXiv preprint arXiv:2601.03590 (2026) 3, 5, 7, 8
Guo, Z., Yang, Z., Li, Y., Zhang, X., Gao, W., Wang, J., Li, C., Liu, X., Jian, P.: Can llms see without pixels? benchmarking spatial intelligence from textual descriptions. arXiv preprint arXiv:2601.03590 (2026) 3, 5, 7, 8
-
[17]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cogni- tive mapping and planning for visual navigation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2616– 2625 (2017) 2
2017
-
[18]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Hong, Y., Lin, C., Du, Y., Chen, Z., Tenenbaum, J.B., Gan, C.: 3d con- cept learning and reasoning from multi-view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9202–9212 (2023) 1, 2, 3, 5, 6
2023
-
[19]
arXiv preprint arXiv:2507.23478 (2025) 2, 5
Huang, T., Zhang, Z., Tang, H.: 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding. arXiv preprint arXiv:2507.23478 (2025) 2, 5
-
[20]
arXiv preprint arXiv:2506.01946 (2025)
Huang, X., Wu, J., Xie, Q., Han, K.: 3drs: Mllms need 3d-aware representa- tion supervision for scene understanding. arXiv preprint arXiv:2506.01946 (2025) 5
-
[21]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 1, 10, 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An, P., Xue, X., Su, Q., Lyu, H., Zheng, X., Liu, J., Wang, Z., Zhang, S.: Robobrain: A unified brain model for robotic manipulation from abstract to concrete (2025),https://arxiv.org/abs/2502.2125712
-
[23]
OpenVLA: An Open-Source Vision-Language-Action Model
Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9579–9589 (2024) 6
2024
-
[25]
Lee, P.Y., Je, J., Park, C., Uy, M.A., Guibas, L., Sung, M.: Perspective- aware reasoning in vision-language models via mental imagery simulation. 18 J. Choi and S. Lee In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9241–9251 (2025) 5, 8
2025
-
[26]
In: Findings of the Association for Computational Linguistics: EMNLP 2025
Lee, S., Choi, J., Kang, I., Kim, J., Park, J., Shim, H.: 3d-aware vision- language models fine-tuning with geometric distillation. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 10628–10647 (2025) 5
2025
-
[27]
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer (2024), https://arxiv.org/abs/2408.0332612
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [28]
-
[29]
Advances in Neural Information Processing Systems37, 140903–140936 (2024) 5
Linghu, X., Huang, J., Niu, X., Ma, X.S., Jia, B., Huang, S.: Multi-modal situated reasoning in 3d scenes. Advances in Neural Information Processing Systems37, 140903–140936 (2024) 5
2024
-
[30]
arXiv preprint arXiv:2510.16714 (2025) 5
Linghu, X., Huang, J., Zhu, Z., Jia, B., Huang, S.: Scenecot: Elicit- ing grounded chain-of-thought reasoning in 3d scenes. arXiv preprint arXiv:2510.16714 (2025) 5
-
[31]
Advances in neural information processing systems36, 34892–34916 (2023) 1
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1
2023
-
[32]
arXiv preprint arXiv:2509.10884 (2025) 5
Liu, Q., Huang, T., Zhang, Z., Tang, H.: Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884 (2025) 5
-
[33]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025) 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
arXiv preprint arXiv:2505.12081 (2025) 6
Liu, Y., Qu, T., Zhong, Z., Peng, B., Liu, S., Yu, B., Jia, J.: Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning. arXiv preprint arXiv:2505.12081 (2025) 6
-
[35]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual-rft: Visual reinforcement fine-tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2034–2044 (2025) 5
2034
-
[36]
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
Peng, Y., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., Yang, X.: Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536 (2025) 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Seed, B., :, Chen, J., Fan, T., Liu, X., Liu, L., Lin, Z., Wang, M., Wang, C., Wei, X., Xu, W., Yuan, Y., Yue, Y., Yan, L., Yu, Q., Zuo, X., Zhang, C., Zhu, R., An, Z., Bai, Z., Bao, Y., Bin, X., Chen, J., Chen, F., Chen, Dense Reward for Multi-View 3D Reasoning 19 H., Chen, R., Chen, L., Chen, Z., Chen, J., Chen, S., Chen, K., Chen, Z., Chen, J., Chen, J...
-
[39]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 5, 8, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
HybridFlow: A Flexible and Efficient RLHF Framework
Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024) 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2026) 10, 11
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[42]
arXiv preprint arXiv:2503.23829 (2025) 5 20 J
Su, Y., Yu, D., Song, L., Li, J., Mi, H., Tu, Z., Zhang, M., Yu, D.: Cross- ing the reward bridge: Expanding rl with verifiable rewards across diverse domains. arXiv preprint arXiv:2503.23829 (2025) 5 20 J. Choi and S. Lee
-
[43]
Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Li, H., Zhu, J., Chen, J., Xu, J., Xu, J., Chen, J., Lin, J., Chen, J., Wang, J., Chen, J.,...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
In: Proceedings of the Com- puter Vision and Pattern Recognition Conference
Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) 3, 4, 7, 10, 5, 8
2025
-
[45]
In: The Fourteenth International Conference on Learning Represen- tations (2026),https://openreview.net/forum?id=0FhrtdKLtD2, 3, 5, 6, 7, 9, 10, 11, 4
Wang, Q., Yin, B., Zhang, P., Zhang, J., Wang, K., Wang, Z., Zhang, J., Chandrasegaran, K., Liu, H., Krishna, R., Xie, S., Li, M., Wu, J., Fei-Fei, L.: Understanding VLMs spatial mental modeling capability from limited views. In: The Fourteenth International Conference on Learning Represen- tations (2026),https://openreview.net/forum?id=0FhrtdKLtD2, 3, 5,...
2026
-
[46]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geo- metric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024) 3
2024
-
[47]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
In: Proceedings of the IEEE/CVF In- ternational conference on computer vision
Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: Gridmm: Grid memory map for vision-and-language navigation. In: Proceedings of the IEEE/CVF In- ternational conference on computer vision. pp. 15625–15636 (2023) 2
2023
-
[49]
Advances in neural information processing systems35, 24824–24837 (2022) 6
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) 6
2022
-
[50]
Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Liang, X., Wang, Y., Li, J., Miao, Z., et al.: Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. arXiv preprint arXiv:2506.14245 (2025) 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capa- bilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025) 10, 11, 12 Dense Reward for Multi-View 3D Reasoning 21
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632–10643 (2025) 2, 5, 7, 9
2025
-
[54]
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence
Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025) 5, 16
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yang, Y., He, X., Pan, H., Jiang, X., Deng, Y., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., et al.: R1-onevision: Advancing generalized multi- modal reasoning through cross-modal formalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2376–2385 (2025) 6
2025
-
[56]
In: The eleventh international conference on learning representations (2022) 6
Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: The eleventh international conference on learning representations (2022) 6
2022
-
[57]
Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Yuan, Y., Cui, H., Huang, Y., Chen, Y., Ni, F., Dong, Z., Li, P., Zheng, Y., Hao, J.: Embodied-r1: Reinforced embodied reasoning for general robotic manipulation. arXiv preprint arXiv:2508.13998 (2025) 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
arXiv preprint arXiv:2503.18013 (2025) 5
Zhan, Y., Zhu, Y., Zheng, S., Zhao, H., Yang, F., Tang, M., Wang, J.: Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning. arXiv preprint arXiv:2503.18013 (2025) 5
-
[59]
129373, 5, 8
Zhang, J., Huang, J., Yao, H., Liu, S., Zhang, X., Lu, S., Tao, D.: R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization (2025),https://arxiv.org/abs/2503. 129373, 5, 8
2025
-
[60]
arXiv preprint arXiv:2601.13029 (2026) 1, 2, 3, 5, 10, 11, 12
Zhang, Z., Wu, Y., Jia, L., Wang, Y., Zhang, Z., Li, Y., Ran, B., Zhang, F., Sun, Z., Yin, Z., et al.: Think3d: Thinking with space for spatial reasoning. arXiv preprint arXiv:2601.13029 (2026) 1, 2, 3, 5, 10, 11, 12
- [61]
-
[62]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 10, 11 22 J. Choi and S. Lee
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
objects”: [ {“name
Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 2 Dense Reward for Multi-View 3D Reasoning 1 Dense Reward for Multi-View 3D Reasoning with Global Maps and...
2023
-
[65]
Image X"> - [object_name]: ego_direction=
<think>: Complete a structured reasoning trajectory with 4 numbered steps: (1) Selecting a view: Read the question, identify the spatial anchor, locate it in the CogMap, and determine which viewpoint is most relevant, e.g., We should consider the viewpoint from <Image X>. (2) Egocentric layout: Convert the allocentric CogMap into an egocentric frame align...
-
[66]
X. option_text
<answer>: “X. option_text” [CogMap Instruction] <cogmap_gen_instruction> [egocentric Map Rules] –The viewpoint’s facing direction becomes “forward” in the EgoMap. –Objects in the opposite direction become “back”. –Apply 90◦ rotation for left/right accordingly. –For diagonal facings, decompose and rotate consistently. –Include only objects that are spatial...
-
[67]
Assign (x, y) coordinates to each object based on their estimated positions in the scene
<CogMap>: Construct an allocentric cognitive map as a 10×10 grid by observing the video frames. Assign (x, y) coordinates to each object based on their estimated positions in the scene
-
[68]
[object_name]
<think>: Complete a structured reasoning trajectory with 4 numbered steps: (1) Selecting a viewpoint: Identify the standing position and facing target in the CogMap from the question, and compute the facing direction. (2) Egocentric layout: Convert the allocentric CogMap into an egocentric frame aligned with the standing position’s facing direction. Outpu...
-
[69]
X. option_text
<answer>: “X. option_text”. [CogMap Instruction] <cogmap_gen_instruction> [Egocentric Map Rules] –The facing direction (from standing position toward the target object) becomes “forward” in the ego frame. –Compute displacement vectors from the standing position to each queried object. –Rotation rules: Grid [0,0]=top-left, [9,9]=bottom-right. •facing=up→fo...
-
[70]
<CogMap>: Construct an allocentric cognitive map from multi-view images
-
[71]
Image X"> - [object_name]: ego_direction=
<think>: Complete a structured reasoning trajectory with 4 numbered steps: (1) Selecting a view: Read the question, identify the spatial anchor, locate it in the CogMap, and determine which viewpoint is most relevant, e.g., We should consider the viewpoint from <Image X>. (2) Egocentric layout: Convert the allocentric CogMap into an egocentric frame align...
-
[72]
X. option_text
<answer>: “X. option_text” [CogMap Instruction] <cogmap_gen_instruction> [Egocentric Map Rules] –The viewpoint’s facing direction becomes “forward” in the EgoMap. –Objects in the opposite direction become “back”. –Apply 90◦ rotation for left/right accordingly. –For diagonal facings, decompose and rotate consistently. –Include only objects that are spatial...
-
[73]
The black chair with pillows is placed in front of a television
-
[74]
There are two yellow leather sofas in the room
-
[75]
left”. In the EgoMap, “TV
The black chair is positioned between the two sofas. From the perspective of someone standing at the location of the two single sofas and facing the same direction: - The black chair with pillows would be directly in front of the person. - To the left of the black chair, there would be one of the yellow leather sofas. Therefore, if you were positioned whe...
-
[76]
In par- ticular, adding egocentric guidance with the trajectory of viewpoint selection leads to the greatest improvement in the final score
The full setting achieves the highest plateau at about 62, which shows that spatial-map supervision and trajectory supervision are complementary. In par- ticular, adding egocentric guidance with the trajectory of viewpoint selection leads to the greatest improvement in the final score. For GRPO, the right plot in Figure A10 shows the mean reward over 200 ...
-
[77]
- Image 2 shows the same room but rotated 90 degrees clockwise from Image 1
Understanding the initial positions: - Image 1 shows a room with a white ceiling, a white wall, and a door. - Image 2 shows the same room but rotated 90 degrees clockwise from Image 1. - Image 3 shows the room again, but rotated 90 degrees clockwise from Image 2
-
[78]
- There is a window on the right side
Identifying the objects in Image 2: - There is a green cabinet on the left side. - There is a window on the right side. - There is a brown cabinet below the window. - There are several pictures hanging on the wall above the brown cabinet
-
[79]
- This means you would be facing the window and the brown cabinet
Determining the new orientation: - If you stand at the same spot and face the same direction as shown in Image 2, and then turn 90 degrees to the left, you would be facing the opposite direction of Image 2. - This means you would be facing the window and the brown cabinet
-
[80]
up". After turn left 90, the effective pose becomes[5,5] facing=
Identifying the object to your right: - When you face the window and the brown cabinet, the object to your right would be the green cabinet. Therefore, if you turn 90 degrees to the left from the position shown in Image 2, the object to your right would be the green cabinet. Answer:A </think> <answer>A. Green cabinet</answer> Ours:<think> (1) Selecting a ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.