pith. sign in

arxiv: 2606.31329 · v2 · pith:PZ4QPSHBnew · submitted 2026-06-30 · 💻 cs.RO · cs.AI

3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

Pith reviewed 2026-07-02 19:13 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords hierarchical VLA3D trajectory predictionrobot manipulationdepth reconstructionvision-language-action modelspoint cloud policygeneralization under shifts
0
0 comments X

The pith

Augmenting a VLM with a depth encoder and reconstruction objective produces metrically accurate 3D trajectories that plug directly into point-cloud controllers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current hierarchical vision-language-action systems generate 2D end-effector trajectories from VLMs but feed them to low-level policies that operate on 3D point clouds, forcing each waypoint onto whatever surface depth lies beneath and creating geometric distortion. 3D HAMSTER adds a dedicated depth encoder to the VLM along with a dense depth reconstruction training objective so the planner directly outputs 3D waypoint sequences. These sequences integrate into existing point-cloud policies without extra calibration steps. Experiments across trajectory prediction, simulation, and real-robot manipulation show consistent gains over both proprietary VLMs and 2D-guided baselines, with the largest improvements when object appearance changes or when language, spatial, and visual conditions are novel.

Core claim

The central claim is that a VLM augmented with a dedicated depth encoder and trained under a dense depth reconstruction objective can predict metrically reliable 3D waypoint sequences. These sequences integrate directly into a pointcloud-based low-level policy, eliminating the distortion that occurs when 2D predictions are assigned the depth of the nearest scene surface. The resulting hierarchical system outperforms 2D-guided baselines and proprietary VLMs across prediction accuracy, simulated tasks, and real-world manipulation, with the largest margins under appearance-altering shifts and unseen language, spatial, and visual conditions.

What carries the argument

The depth encoder plus dense depth reconstruction objective added to the VLM planner, which generates 3D waypoint sequences for direct use by point-cloud policies.

If this is right

  • 3D trajectories eliminate the need to assign surface depths to 2D waypoints, removing a source of geometric distortion in point-cloud policies.
  • The same hierarchical structure yields higher success rates in both simulation and real-robot manipulation.
  • Performance gains are largest precisely when appearance, language, spatial layout, or visual conditions differ from training.
  • No additional calibration or new low-level policy components are required for the integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same depth-augmented planning module could be tested on navigation or mobile manipulation tasks that already use point-cloud controllers.
  • Measuring how reconstruction loss weight affects metric accuracy in cluttered scenes would quantify the sensitivity of the 3D output.
  • If the depth encoder can be frozen after pre-training, the approach may scale to larger VLMs with modest extra compute.

Load-bearing premise

Adding a depth encoder and dense depth reconstruction objective to the VLM produces metrically reliable 3D waypoints that integrate into point-cloud policies without new geometric errors or extra calibration.

What would settle it

A controlled ablation that removes the depth encoder and reconstruction objective from the same VLM backbone and measures whether 3D trajectory error and downstream task success rates drop to match the 2D-guided baseline levels.

Figures

Figures reproduced from arXiv: 2606.31329 by Byungkun Lee, Dongjin Kim, Dongyoon Hwang, Hoiyeong Jin, Hojoon Lee, Hyojin Jang, Hyunseung Kim, Jaegul Choo, Jueun Mun, Minho Park.

Figure 1
Figure 1. Figure 1: Comparison of 2D and 3D guidance in hierarchical VLAs. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of 3D HAMSTER. The framework decouples semantic planning and motor execution with two-stage training strategy: Stage 1 aligns depth features with the VLM space using a dense reconstruction loss (Ldepth) while preserving VLM capabilities; Stage 2 fine-tunes for trajectory prediction. The 3D trajectory planner fuses RGB and depth to generate metrically reliable 3D trajectories, which the trajectory-… view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end manipulation evaluation setups. (a) Simulation environments from the Colosseum benchmark, showcasing various visual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of 3D trajectory predictions. Each trajectory is shown from two viewpoints: baseline predictions that appear [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Recent work in this paradigm uses 2D end-effector trajectories predicted by a Vision-Language Model (VLM) as explicit guidance for a downstream policy. However, state-of-the-art low-level policies operate in 3D metric space on point clouds, and feeding them 2D guidance that lacks depth forces each waypoint to be assigned the depth of whatever scene surface lies beneath it, producing geometrically distorted trajectories. We propose 3D HAMSTER, a hierarchical framework that closes this gap by having the planner directly output metrically reliable 3D trajectories. We augment a VLM with a dedicated depth encoder and a dense depth reconstruction objective to predict 3D waypoint sequences, which are directly integrated into a pointcloudbased low-level policy. Across 3D trajectory prediction, simulation, and real-world manipulation, 3D HAMSTER consistently outperforms proprietary VLMs and 2D-guided baselines, with the largest gains under appearance-altering shifts and unseen language, spatial, and visual conditions. The project page is available at https://davian-robotics.github.io/3D_HAMSTER/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes 3D HAMSTER, a hierarchical VLA framework that augments a VLM with a dedicated depth encoder and dense depth reconstruction objective so the planner directly outputs metrically reliable 3D waypoint sequences. These sequences are fed to an existing point-cloud low-level policy, avoiding the geometric distortion that arises when 2D trajectories are projected onto scene surfaces. Experiments across 3D trajectory prediction, simulation, and real-robot manipulation report consistent gains over proprietary VLMs and 2D-guided baselines, with the largest improvements under appearance shifts and unseen language/spatial/visual conditions.

Significance. If the metric reliability of the predicted 3D trajectories is rigorously established, the work would meaningfully close the 2D-to-3D gap in hierarchical VLA architectures and improve robustness under distribution shift. The explicit integration of depth reconstruction into the planner is a concrete architectural contribution that aligns high-level planning with the metric space used by modern low-level policies.

major comments (2)
  1. [Abstract] Abstract: the central claim that the depth encoder plus dense depth reconstruction objective produces 'metrically reliable 3D trajectories' that integrate directly into a point-cloud policy is load-bearing for all reported gains. Standard monocular depth objectives are scale-ambiguous; the manuscript provides no description of scale supervision, camera calibration, or any other mechanism that recovers absolute metric scale under varying intrinsics or appearance shifts. Without this, the geometric-distortion argument against 2D baselines cannot be substantiated.
  2. [Abstract] Abstract (and any methods section describing the depth objective): the paper does not report an ablation that isolates the contribution of metric scale versus relative depth or 2D trajectory shape. If the performance advantage disappears once scale is normalized or when the low-level policy is given relative rather than absolute 3D waypoints, the claimed benefit of 3D guidance would be undermined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify gaps in the description of metric scale handling. We address each point below and commit to revisions where the manuscript is incomplete.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the depth encoder plus dense depth reconstruction objective produces 'metrically reliable 3D trajectories' that integrate directly into a point-cloud policy is load-bearing for all reported gains. Standard monocular depth objectives are scale-ambiguous; the manuscript provides no description of scale supervision, camera calibration, or any other mechanism that recovers absolute metric scale under varying intrinsics or appearance shifts. Without this, the geometric-distortion argument against 2D baselines cannot be substantiated.

    Authors: We agree the manuscript does not describe the scale supervision or calibration mechanism. The current text only states that a dense depth reconstruction objective is used. We will revise the methods section to explicitly detail how absolute metric scale is obtained (including training data sources and any calibration steps) so that the claim of metric reliability can be properly evaluated. revision: yes

  2. Referee: [Abstract] Abstract (and any methods section describing the depth objective): the paper does not report an ablation that isolates the contribution of metric scale versus relative depth or 2D trajectory shape. If the performance advantage disappears once scale is normalized or when the low-level policy is given relative rather than absolute 3D waypoints, the claimed benefit of 3D guidance would be undermined.

    Authors: We concur that the requested ablation is missing and would directly test whether metric scale drives the reported gains. We will add an ablation that compares absolute 3D waypoints against scale-normalized and relative versions, reporting results on the same simulation and real-robot tasks. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical claims

full rationale

The paper proposes an architectural augmentation (dedicated depth encoder plus dense depth reconstruction objective) to enable direct 3D waypoint output from a VLM, then integrates those waypoints into an existing point-cloud policy. No equations, derivations, or load-bearing steps are shown that reduce by construction to fitted inputs, self-citations, or renamed known results. Claims rest on reported outperformance across prediction, simulation, and real-world tasks rather than any self-referential loop. This is the common case of a self-contained empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The depth encoder and reconstruction objective are introduced as part of the method but are not broken down into fitted values or background assumptions.

pith-pipeline@v0.9.1-grok · 5795 in / 1191 out tokens · 32093 ms · 2026-07-02T19:13:43.546582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 17 canonical work pages · 9 internal anchors

  1. [1]

    PaliGemma: A versatile 3B VLM for transfer

    L. Beyer, A. Steiner, A. S. Pintoet al., “Paligemma: A versatile 3b vlm for transfer,”arXiv preprint arXiv:2407.07726, 2024. 1

  2. [2]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Chenget al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025. 1, 2, 3, 5

  3. [3]

    Flamingo: a visual language model for few-shot learning,

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barret al., “Flamingo: a visual language model for few-shot learning,” inNeurIPS, 2022. 1

  4. [4]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xuet al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, 2023. 1

  5. [5]

    Openvla: An open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiaoet al., “Openvla: An open-source vision-language-action model,” inCoRL, 2024. 1

  6. [6]

    pi0.5: a vision-language-action model with open-world generalization,

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Es- mailet al., “pi0.5: a vision-language-action model with open-world generalization,” inCoRL, 2025. 1, 5

  7. [7]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fanet al., “GR00T N1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025. 1

  8. [8]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    S. Fei, S. Wang, J. Shi, Z. Dai, J. Cai, P. Qianet al., “Libero-plus: In-depth robustness analysis of vision-language-action models,”arXiv preprint arXiv:2510.13626, 2025. 1

  9. [9]

    Hamster: Hierarchical action models for open-world robot manipulation,

    Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, R. Yuet al., “Hamster: Hierarchical action models for open-world robot manipulation,” in ICLR, 2025. 1, 2, 5

  10. [10]

    Generalvla: Generalizable vision-language-action models with knowledge-guided trajectory planning,

    G. Ma, S. Wang, Z. Zhang, S. Yu, and H. Tang, “Generalvla: Generalizable vision-language-action models with knowledge-guided trajectory planning,”arXiv preprint arXiv:2602.04315, 2026. 1, 2

  11. [11]

    Thinkact: Vision-language-action reasoning via reinforced visual la- tent planning,

    C.-P. Huang, Y .-H. Wu, M.-H. Chen, Y .-C. F. Wang, and F.-E. Yang, “Thinkact: Vision-language-action reasoning via reinforced visual la- tent planning,” inNeurIPS, 2025. 1, 2

  12. [12]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” inRSS, 2024. 1, 2

  13. [13]

    Act3d: 3d feature field transformers for multi-task robotic manipulation,

    T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3d: 3d feature field transformers for multi-task robotic manipulation,” in CoRL, 2023. 1, 2

  14. [14]

    3d diffuser actor: Policy diffusion with 3d scene representations,

    T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki, “3d diffuser actor: Policy diffusion with 3d scene representations,” inCoRL, 2024. 1, 3

  15. [15]

    3d flowmatch actor: Unified 3d policy for single-and dual-arm manipulation,

    N. Gkanatsios, J. Xu, M. Bronars, A. Mousavian, T.-W. Keet al., “3d flowmatch actor: Unified 3d policy for single-and dual-arm manipulation,”arXiv preprint arXiv:2508.11002, 2025. 1, 3, 4, 5

  16. [16]

    G2vlm: Ge- ometry grounded vision language model with unified 3d reconstruction and spatial reasoning,

    W. Hu, J. Lin, Y . Long, Y . Ran, L. Jiang, Y . Wanget al., “G2vlm: Ge- ometry grounded vision language model with unified 3d reconstruction and spatial reasoning,”arXiv preprint arXiv:2511.21688, 2025. 2

  17. [17]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- chetiet al., “DROID: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024. 2, 3, 4, 5

  18. [18]

    The colosseum: A benchmark for evaluating generalization for robotic manipulation,

    W. Pumacay, I. Singh, J. Duan, R. Krishna, J. Thomason, and D. Fox, “The colosseum: A benchmark for evaluating generalization for robotic manipulation,” inRSS, 2024. 2, 5, 6

  19. [19]

    DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    J. Wen, Y . Zhu, J. Li, Z. Tang, C. Shen, and F. Feng, “Dexvla: Vision-language model with plug-in diffusion expert for general robot control,”arXiv preprint arXiv:2502.05855, 2025. 2

  20. [20]

    Robopoint: A vision-language model for spatial affordance prediction for robotics,

    W. Yuan, J. Duan, V . Blukis, W. Pumacay, R. Krishna, A. Muraliet al., “Robopoint: A vision-language model for spatial affordance prediction for robotics,” inCoRL, 2024. 2, 4

  21. [21]

    Towards Spatial Trace with Reasoning in Vision-Language Models for Robotics

    E. Zhou, C. Chi, Y . Li, J. An, J. Zhang, S. Ronget al., “Robotracer: Mastering spatial trace with reasoning in vision-language models for robotics,”arXiv preprint arXiv:2512.13660, 2025. 2, 5

  22. [22]

    Robobrain 2.5: Depth in sight, time in mind,

    H. Tan, E. Zhou, Z. Li, Y . Xu, Y . Jiet al., “Robobrain 2.5: Depth in sight, time in mind,”arXiv preprint arXiv:2601.14352, 2026. 2, 5

  23. [23]

    Pivot: Iterative visual prompting elicits actionable knowledge for vlms,

    S. Nasiriany, F. Xia, W. Yu, T. Xiao, J. Liang, I. Dasguptaet al., “Pivot: Iterative visual prompting elicits actionable knowledge for vlms,”arXiv preprint arXiv:2402.07872, 2024. 2

  24. [24]

    Kite: Keypoint- conditioned policies for semantic manipulation,

    P. Sundaresan, S. Belkhale, D. Sadigh, and J. Bohg, “Kite: Keypoint- conditioned policies for semantic manipulation,” inICML, 2023. 2

  25. [25]

    Moka: Open-world robotic manipulation through mark-based visual prompting,

    F. Liu, K. Fang, P. Abbeel, and S. Levine, “Moka: Open-world robotic manipulation through mark-based visual prompting,”RSS, 2024. 2

  26. [26]

    Roborefer: Towards spatial referring with reasoning in vision-language models for robotics,

    E. Zhou, J. An, C. Chi, Y . Han, S. Rong, C. Zhanget al., “Roborefer: Towards spatial referring with reasoning in vision-language models for robotics,” inNeurIPS, 2025. 2, 3, 4

  27. [27]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,

    J. Gu, S. Kirmani, P. Wohlhartet al., “Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,” inICRL, 2024. 2

  28. [28]

    N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models

    Y . Wang, L. Ke, B. Zhang, T. Qu, H. Yu, Z. Huanget al., “N3d- vlm: Native 3d grounding enables accurate spatial reasoning in vision- language models,”arXiv preprint arXiv:2512.16561, 2025. 2

  29. [29]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors,

    D. Zhenget al., “Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors,”NeurIPS, 2025. 2

  30. [30]

    An embodied generalist agent in 3d world,

    J. Huang, S. Yong, X. Ma, X. Linghu, P. Li, Y . Wanget al., “An embodied generalist agent in 3d world,” inICML, 2024. 2

  31. [31]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfielet al., “Diffusion policy: Visuomotor policy learning via action diffusion,” The International Journal of Robotics Research, 2025. 2

  32. [32]

    Diffusion models for robotic manipulation: a survey,

    R. Wolf, Y . Shi, S. Liu, and R. Rayyes, “Diffusion models for robotic manipulation: a survey,”Frontiers in Robotics and AI, 2025. 2

  33. [33]

    Rvt-2: Learning precise manipulation from few demonstrations,

    A. Goyal, V . Blukis, J. Xu, Y . Guoet al., “Rvt-2: Learning precise manipulation from few demonstrations,” inRSS, 2024. 2

  34. [34]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,”arXiv preprint arXiv:2209.03003, 2022. 3

  35. [35]

    Rlbench: The robot learning benchmark & learning environment,

    S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, “Rlbench: The robot learning benchmark & learning environment,”IEEE Robotics and Automation Letters, 2020. 3, 4, 5

  36. [36]

    Interndata-m1,

    I.-M. contributors, “Interndata-m1,” https://github.com/InternRobotics/ InternManip, 2025. 4

  37. [37]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    R. Wang, S. Xu, Y . Dong, Y . Deng, J. Xiang, Z. Lvet al., “Moge-2: Accurate monocular geometry with metric scale and sharp details,” arXiv preprint arXiv:2507.02546, 2025. 4

  38. [38]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,

    M. Deitke, C. Clark, S. Lee, R. Tripathi, Y . Yang, J. S. Parket al., “Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models,” inCVPR, 2025. 4

  39. [39]

    Lvis: A dataset for large vocabulary instance segmentation,

    A. Gupta, P. Doll ´ar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” inCVPR, 2019. 4

  40. [40]

    arXiv:2510.13795 (2025)

    Y . Zhang, B. Ni, X.-S. Chen, H.-R. Zhang, Y . Rao, H. Penget al., “Bee: A high-quality corpus and full-stack suite to unlock advanced fully open mllms,”arXiv preprint arXiv:2510.13795, 2025. 4

  41. [41]

    Masked depth modeling for spatial perception,

    B. Tan, C. Sun, X. Qin, H. Adai, Z. Fuet al., “Masked depth modeling for spatial perception,”arXiv preprint arXiv:2601.17895, 2026. 5

  42. [42]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wanget al., “Lora: Low-rank adaptation of large language models.”ICLR, 2022. 5