pith. sign in

arxiv: 2606.17480 · v1 · pith:LJB3IIZWnew · submitted 2026-06-16 · 💻 cs.CV · cs.RO

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

Pith reviewed 2026-06-27 01:51 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords vision-language-action3D reconstructionrobot planningmulti-view geometrymemory managementobject-centric planningknowledge bank
0
0 comments X

The pith

Geometry-prior multi-view reconstruction and governed memory enable more reliable robot trajectory planning in vision-language-action systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that two bottlenecks limit generalist vision-language-action systems: monocular 3D reconstructions that hallucinate pose and unseen geometry, and memory systems that retrieve without controls on quality or conflicts. It introduces GeoFuse-MV3D to verify geometry priors against calibrated multi-view masks and fuse only geometry while keeping appearance, plus an upgraded KnowledgeBank with explicit metadata for quality, confidence, lifecycle, verifier, and conflict plus precision-oriented retrieval. If correct, these changes produce measurable gains in reconstruction fidelity and planning success rates on benchmarks, allowing more stable end-effector paths and better reuse of manipulation experience. A sympathetic reader would care because accurate object shapes and managed memory address core obstacles to deploying generalist robots that must act reliably from language and RGB-D input.

Core claim

We address the bottlenecks in GeneralVLA by adding GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance, and by converting KnowledgeBank into a governed long-term memory with explicit quality, confidence, lifecycle, verifier, and conflict metadata together with precision-oriented retrieval. On GSO-30, GeoFuse-MV3D reduces CD and LPIPS by 2.20% and 2.02% while raising PSNR and SSIM by 2.36% and 1.03% over MV-SAM3D; on Terminal-Bench 2.0 and SWE-Bench Verified the governed memory raises SR by 4.53% and resolve rate by

What carries the argument

GeoFuse-MV3D, the geometry-prior-guided MV-SAM3D reconstruction branch that verifies external cues against input-view masks and fuses geometry only, together with the governed KnowledgeBank that stores explicit metadata for quality, confidence, lifecycle, verifier, and conflict and uses precision-oriented retrieval.

If this is right

  • GeoFuse-MV3D produces lower Chamfer Distance and LPIPS and higher PSNR and SSIM than the MV-SAM3D baseline on GSO-30.
  • The governed KnowledgeBank raises success rate on Terminal-Bench by 4.53% and resolve rate on SWE-Bench by 3.73% while lowering AS.
  • Stable object shapes from multi-view fusion support more reliable end-effector paths for manipulation.
  • Metadata controls on memory reduce conflicts and improve geometric relevance of retrieved snippets during planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verification approach could apply to other multi-view reconstruction pipelines where priors must be checked against observations.
  • Governed memory with lifecycle and conflict metadata may scale better than simple append-and-retrieve systems for long-horizon robot tasks.
  • Tighter integration of the reconstruction output directly into the memory verifier could further reduce downstream planning errors.

Load-bearing premise

Calibrated multi-view observations are available and let the geometry priors verify and refine shapes without adding new hallucinations or biases from the priors.

What would settle it

A controlled test on scenes with known ground-truth shapes that supplies calibrated multi-view inputs yet shows no reduction in reconstruction error or an increase in new artifacts would falsify the benefit of GeoFuse-MV3D.

Figures

Figures reproduced from arXiv: 2606.17480 by Boxin Shi, Guoqing Ma, Hao Tang, Haoyu Wang, Yandong Guo, Zeyu Zhang.

Figure 1
Figure 1. Figure 1: Overview of GeneralVLA-2 for robot manipulation. When calibrated multi-view ob [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GeoFuse-MV3D reconstruction branch in GeneralVLA-2. The branch keeps the same [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the governed KnowledgeBank module used by GeneralVLA-2. The mod [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-world training-free demonstrations for four manipulation tasks. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the first [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative comparison between MV-SAM3D and GeoFuse-MV3D on the sec [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simulation visualization examples from three representative manipulation rollouts. [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional real-world robot execution visualizations from three recorded trials. Frames in [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GeneralVLA-2, extending prior GeneralVLA work for vision-language-action robot planning. It proposes GeoFuse-MV3D, a geometry-prior-guided multi-view reconstruction branch that verifies cues via input-view masks, soft visual-hull, and axis-wise refinement before fusing geometry while preserving appearance. It also upgrades KnowledgeBank to a governed memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata plus precision-oriented retrieval. Evaluations report that GeoFuse-MV3D reduces CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03% over MV-SAM3D on GSO-30, and the governed memory improves Terminal-Bench SR by 4.53% and SWE-Bench resolve rate by 3.73% while reducing AS by 4.95% and 5.65% over ReasoningBank.

Significance. If the modest metric gains can be shown to arise specifically from geometry-prior verification without introducing new biases or hallucinations, the work would provide a practical hierarchical interface for stable 3D object evidence and controlled long-term memory in manipulation planning. The explicit metadata and conflict handling in the memory module address a clear limitation of semantic-retrieval banks. However, the small effect sizes and absence of isolating controls make the practical significance uncertain without further validation.

major comments (2)
  1. [Abstract] Abstract: the headline claim that calibrated multi-view observations plus geometry priors refine shapes without new hallucinations or biases rests on aggregate 2% metric deltas versus MV-SAM3D on GSO-30, yet no ablation isolates prior quality, no conflict cases are reported, and no variance or statistical significance is mentioned; this leaves open that gains may derive from fusion mechanics alone.
  2. [Abstract] Abstract: the description of verification via input-view masks, soft visual-hull support, and axis-wise refinement does not address how prior quality is assessed or how conflicts between priors and observations are resolved, so the risk that priors silently degrade unseen geometry remains untested.
minor comments (1)
  1. [Abstract] The abstract states evaluation on Terminal-Bench 2.0 and SWE-Bench Verified but provides no details on task selection, number of runs, or baseline implementation; these should be added for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below, agreeing that the claims require clarification given the modest aggregate gains and lack of isolating experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that calibrated multi-view observations plus geometry priors refine shapes without new hallucinations or biases rests on aggregate 2% metric deltas versus MV-SAM3D on GSO-30, yet no ablation isolates prior quality, no conflict cases are reported, and no variance or statistical significance is mentioned; this leaves open that gains may derive from fusion mechanics alone.

    Authors: We agree the abstract phrasing is too strong. The reported deltas are aggregate metrics on GSO-30 with no ablations separating geometry-prior verification from fusion mechanics, no conflict cases, and no variance or significance tests. We will revise the abstract to describe the observed metric changes without asserting isolation of effects or absence of biases/hallucinations, and add a limitations paragraph noting these gaps. revision: yes

  2. Referee: [Abstract] Abstract: the description of verification via input-view masks, soft visual-hull support, and axis-wise refinement does not address how prior quality is assessed or how conflicts between priors and observations are resolved, so the risk that priors silently degrade unseen geometry remains untested.

    Authors: The method section outlines the verification steps, but we concur that explicit mechanisms for assessing prior quality, resolving prior-observation conflicts, and testing for degradation of unseen geometry are not detailed or evaluated. We will expand the relevant method description and add a note on this untested risk as a limitation in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper introduces GeoFuse-MV3D reconstruction and a governed KnowledgeBank memory system, then reports concrete metric improvements (CD, LPIPS, PSNR, SSIM on GSO-30; SR, resolve rate, AS on Terminal-Bench and SWE-Bench) versus named baselines. These are presented as measured outcomes on independent test sets with no equations, first-principles derivations, fitted-parameter predictions, or self-citation chains that reduce the claimed gains to the inputs by construction. The methods are described procedurally (masks, visual-hull, axis-wise refinement, metadata fields) without any step that equates a result to its own definition or fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no access to full methods, equations, or assumptions in the manuscript.

pith-pipeline@v0.9.1-grok · 5896 in / 1087 out tokens · 43856 ms · 2026-06-27T01:51:21.247225+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    Towards multi-layered 3d garments animation

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. Berg, W.-Y . Lo, et al. Segment anything. InIEEE International Conference on Computer Vision, pages 3992–4003. IEEE, 2023. doi:10.1109/ICCV51070.2023.00371

  2. [2]

    G. Ma, S. Wang, Z. Zhang, S. Yu, and H. Tang. Generalvla: Generalizable vision-language- action models with knowledge-guided trajectory planning, 2026

  3. [3]

    S. Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, et al. Sam 3d: 3dfy anything in images, 2025

  4. [4]

    Laurentini

    A. Laurentini. The visual hull concept for silhouette-based image understanding. InIEEE Transactions on Pattern Analysis and Machine Intelligence, volume 16, pages 150–162. Insti- tute of Electrical and Electronics Engineers (IEEE), 1994. doi:10.1109/34.273735

  5. [5]

    W. Feng, M. Wu, Z. Chen, C. Yang, H. Qin, Y . Li, X. Liu, G. Fan, Z. An, L. Huang, et al. Fast-sam3d: 3dfy anything in images but faster, 2026

  6. [6]

    B. Li, D. Wu, J. Li, S. Zhou, L. Li, Z. Zeng, and H. Zha. Mv-sam3d: Adaptive multi-view fusion for layout-aware 3d generation, 2026

  7. [7]

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InComputer Vision and Pattern Recognition, pages 5294–

  8. [8]
  9. [9]

    Brohan, N

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, K. Choromanski, T. Ding, D. Driess, K. A. Dubey, C. Finn, P. R. Florence, et al. Rt-2: Vision-language-action models transfer web knowl- edge to robotic control, 2023

  10. [10]

    O. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy, 2024

  11. [11]

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi, et al. Openvla: An open-source vision-language-action model, 2024

  12. [12]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al.π 0: A vision-language-action flow model for general robot control, 2024

  13. [13]

    J. Wen, Y . Zhu, J. Li, M. Zhu, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen, Y . Peng, et al. Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation, 2024

  14. [14]

    V2xp-asg: Generating adversarial scenes for vehicle-to-everything perception

    J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. R. Florence, and A. Zeng. Code as policies: Language model programs for embodied control. InIEEE International Confer- ence on Robotics and Automation, pages 9493–9500, 2022. doi:10.1109/ICRA48891.2023. 10160591. 9

  15. [15]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, pages 540–562, 2023. doi:10.48550/arXiv.2307.05973

  16. [16]

    J. Duan, W. Yuan, W. Pumacay, Y . R. Wang, K. Ehsani, D. Fox, and R. Krishna. Manipulate- anything: Automating real-world robots using vision-language models, 2024

  17. [17]

    Zheng, R

    L. Zheng, R. Wang, X. Wang, and B. An. Synapse: Trajectory-as-exemplar prompting with memory for computer control, 2023

  18. [18]

    Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024

  19. [19]

    Ouyang, J

    S. Ouyang, J. Yan, I.-H. Hsu, Y . Chen, K. Jiang, Z. Wang, R. Han, L. T. Le, S. Daruki, X. Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory, 2025

  20. [20]

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, et al. A survey on llm-as-a-judge, 2024

  21. [21]

    J. Kwok, S. Li, P. Atreya, Y . Liu, M. Pavone, I. Stoica, and A. Mirhoseini. Llm-as-a-verifier: A general-purpose verification framework, 2026. Notion Blog

  22. [22]

    Kerbl, G

    B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42(4):1–14, 2023. doi:10.1145/ 3592433

  23. [23]

    Downs, A

    L. Downs, A. Francis, N. Koenig, B. Kinman, R. Hickman, K. Reymann, T. B. McHugh, and V . Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items.IEEE International Conference on Robotics and Automation, pages 2553–2560, 2022. doi:10.48550/arXiv.2204.11918

  24. [24]

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: From error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600– 612, 2004. doi:10.1109/TIP.2003.819861

  25. [25]

    In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 586–595. IEEE, 2018. doi:10.1109/CVPR.2018.00068

  26. [26]

    M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Shin, T. Wal- she, E. K. Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces, 2026

  27. [27]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. SWE- bench: Can language models resolve real-world github issues? InInternational Confer- ence on Learning Representations, 2023. doi:10.48550/arXiv.2310.06770. URLhttps: //openreview.net/forum?id=VTF8yNQM66

  28. [28]

    SWE-bench Verified.https://openai.com/index/ introducing-swe-bench-verified/, 2024

    OpenAI. SWE-bench Verified.https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-05-29

  29. [29]

    James, Z

    S. James, Z. Ma, D. R. Arrojo, and A. Davison. Rlbench: The robot learning benchmark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2019. doi: 10.1109/LRA.2020.2974707

  30. [30]

    Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, et al. HAMSTER: hierarchical action models for open-world robot manipulation. InInterna- tional Conference on Learning Representations, 2025. doi:10.48550/arXiv.2502.05485

  31. [31]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    A. Suma and S. Dauncey. Deepseek-r1: Incentivizing reasoning capability in llms via rein- forcement learning.arXiv.org, 2025. doi:10.48550/arXiv.2501.12948. 10

  32. [32]

    Sucan, M

    I. Sucan, M. Moll, and L. Kavraki. The open motion planning library.IEEE Robotics & Automation Magazine, 19(4):72–82, 2012. doi:10.1109/MRA.2012.2205651

  33. [33]

    Intelligence, K

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, et al.π 0.5: a vision-language-action model with open-world generalization, 2025

  34. [34]

    W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision-language model for spatial affordance prediction in robotics. InConfer- ence on Robot Learning, pages 4005–4020, 2024

  35. [35]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URLhttps: //qwen.ai/blog?id=qwen3.5

  36. [36]

    Gemini 3 Flash model card

    Google DeepMind. Gemini 3 Flash model card. Model card, Google DeepMind, Decem- ber 2025. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Flash-Model-Card.pdf. Updated December 17, 2025. Accessed: 2026-05-12

  37. [37]

    Gemini 3.1 Pro model card

    Google DeepMind. Gemini 3.1 Pro model card. Model card, Google DeepMind, Febru- ary 2026. URLhttps://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-1-Pro-Model-Card.pdf. Updated February 19, 2026. Accessed: 2026-05-12. 11 Appendix A Additional GeoFuse-MV3D Evaluation This appendix collects the complete GeoFuse-MV3D reconstruction evidence: th...