pith. machine review for the scientific record. sign in

arxiv: 2604.10432 · v2 · submitted 2026-04-12 · 💻 cs.RO

Recognition: unknown

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

Ci-Jyun Liang, Qinbo Zhang, Qi Su, Rongtao Xu, Sifan Zhou, Zhaofeng Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3

classification 💻 cs.RO
keywords AnySlotgoal-conditioned VLAslot-level placementzero-shot robotic manipulationscene markerSlotBenchvision-language-action policiescompositional instructions
0
0 comments X

The pith

AnySlot generates an explicit visual scene marker from language to let goal-conditioned VLA policies handle precise zero-shot slot placement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that inserting a generated visual scene marker as an intermediate goal between language instructions and control improves reliability for slot-level robotic placement tasks. This matters because monolithic VLA policies struggle with the combined demands of semantic grounding and sub-centimeter spatial accuracy under compositional language. By decoupling high-level slot selection from low-level execution, the approach aims to reduce error accumulation without task-specific fine-tuning. The authors also release SlotBench, a simulation benchmark with nine task categories, to measure progress on these precision demands. A sympathetic reader would care because such a split could make generalist robot policies more practical for real-world placement operations that current end-to-end methods cannot yet solve consistently.

Core claim

AnySlot reduces compositional complexity by turning language instructions into an explicit spatial visual goal via scene marker generation, then executes that goal with a goal-conditioned VLA policy. This hierarchical design decouples high-level slot selection from low-level execution to achieve both semantic accuracy and spatial robustness. Experiments demonstrate that the method significantly outperforms flat VLA baselines and previous modular grounding approaches in zero-shot slot-level placement tasks.

What carries the argument

Scene marker generation from language as an explicit visual goal, followed by a goal-conditioned VLA policy that drives the robot to match that marker.

If this is right

  • Compositional language instructions for placement become tractable by separating semantic grounding from spatial control.
  • Zero-shot performance on precision slot tasks rises without requiring task-specific training data.
  • Structured spatial reasoning benchmarks like SlotBench become necessary to evaluate future VLA methods.
  • Monolithic end-to-end VLA policies can be improved by adding an explicit visual goal layer rather than retraining from scratch.
  • Robotic manipulation under variable language gains robustness when high-level selection is isolated from low-level execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same marker-plus-goal pattern could extend to other fine-motor tasks such as peg insertion or part alignment where language must specify exact locations.
  • If marker generation proves reliable across real cameras, the method may reduce the need for full end-to-end language-to-action training in new environments.
  • SlotBench-style benchmarks could expose similar failure modes in other VLA domains that demand sub-centimeter accuracy.
  • Hierarchical visual goals might combine with existing object detectors to handle partially observable scenes without retraining the full policy.

Load-bearing premise

A reliable scene marker can always be generated from the language instruction and the goal-conditioned policy can reach the required sub-centimeter spatial accuracy without further fine-tuning or domain data.

What would settle it

In a held-out set of novel slot placement tasks with unseen language compositions, the generated markers are inaccurate or the policy repeatedly misses target slots by more than one centimeter in zero-shot execution.

Figures

Figures reproduced from arXiv: 2604.10432 by Ci-Jyun Liang, Qinbo Zhang, Qi Su, Rongtao Xu, Sifan Zhou, Zhaofeng Hu.

Figure 1
Figure 1. Figure 1: Overview of flat (a), modular (b), and (c) our goal-conditioned policy. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AnySlot overview. We formulate slot-level placement as goal-conditioned con￾trol. High-level goal construction uses the Nano-Banana image generator to render a blue-sphere goal from the language prompt, lifting it to a view-consistent multi-view overlay via depth and camera calibration. Low-level control uses a goal-conditioned VLA policy (\pi _{0.5} ) with a PaliGemma-3B backbone and action expert, mappin… view at source ↗
Figure 3
Figure 3. Figure 3: SlotBench [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between AnySlot and a VLM-based method. AnySlot accurately grounds the target slot and executes successful placement, while the VLM-based method mislocalizes the target and fails. high-level grounding module relies on Nano-Banana’s prior knowledge without task-specific fine-tuning; it is only used at inference to generate the visual goal Gt. To train the low-level policy, a synthetic dataset \pr… view at source ↗
Figure 5
Figure 5. Figure 5: Real-world goal reconstruction. A visual goal (blue sphere) is generated in the head view, lifted to 3D via depth, and projected into multiple views. The reconstructed goal aligns well with the target location and remains spatially consistent across views, demonstrating effective real-world goal construction. introduced AnySlot, a goal-conditioned framework that converts language into an explicit visual go… view at source ↗
read the original abstract

Vision-Language-Action (VLA) policies have emerged as a versatile paradigm for generalist robotic manipulation. However, precise object placement under compositional language instructions remains a major challenge for modern monolithic VLA policies. Slot-level tasks require both reliable slot grounding and sub-centimeter execution accuracy. To this end, we propose AnySlot, a framework that reduces compositional complexity by introducing an explicit spatial visual goal as an intermediate representation between language grounding and control. AnySlot turns language into an explicit visual goal by generating a scene marker, then executes this goal with a goal-conditioned VLA policy. This hierarchical design effectively decouples high-level slot selection from low-level execution, ensuring both semantic accuracy and spatial robustness. Furthermore, recognizing the lack of existing benchmarks for such precision-demanding tasks, we introduce SlotBench, a comprehensive simulation benchmark featuring nine task categories tailored to evaluate structured spatial reasoning in slot-level placement. Extensive experiments show that AnySlot significantly outperforms flat VLA baselines and previous modular grounding methods in zero-shot slot-level placement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AnySlot, a hierarchical goal-conditioned VLA framework that converts compositional language instructions into an explicit visual scene marker as an intermediate representation, which is then executed by a goal-conditioned policy for zero-shot slot-level placement. It introduces SlotBench, a simulation benchmark with nine task categories focused on structured spatial reasoning, and claims that this decoupling of slot selection from low-level control yields superior performance over flat VLA baselines and prior modular grounding methods.

Significance. If the empirical results hold under rigorous evaluation, the approach could meaningfully advance precise robotic manipulation by separating semantic grounding from spatial execution, addressing a key limitation of monolithic VLAs in tasks requiring sub-centimeter accuracy. The new SlotBench benchmark fills a gap for evaluating compositional spatial tasks and could serve as a standard for future work, provided the marker-generation step proves reliable across categories.

major comments (2)
  1. [Abstract] Abstract: The central claim that the hierarchical design 'ensures both semantic accuracy and spatial robustness' and 'significantly outperforms' baselines is load-bearing on the assumption that the scene marker generator produces spatially precise targets from compositional instructions. No quantitative breakdown of marker localization error, no ablation of marker quality versus end-to-end success rates, and no failure-mode analysis across the nine SlotBench categories are referenced, leaving open the possibility that reported gains are driven primarily by the upstream grounding module rather than the proposed architecture.
  2. [Abstract] The weakest assumption noted in the stress-test—that reliable scene marker generation is always possible and that the policy achieves sub-centimeter accuracy without fine-tuning—directly affects the zero-shot claim. The manuscript provides no evidence (e.g., marker error distributions or policy corrective range analysis) that the goal-conditioned policy can recover from typical VLM grounding inaccuracies on compositional cases, which is required to substantiate the decoupling benefit.
minor comments (2)
  1. [Abstract] The abstract refers to 'flat VLA baselines and previous modular grounding methods' without naming the specific methods or citing their original papers; adding these references would improve traceability.
  2. [Abstract] SlotBench is introduced as addressing the 'lack of existing benchmarks,' but the manuscript could briefly contrast its nine categories with related manipulation benchmarks (e.g., those focused on object rearrangement) to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments on our work. The feedback highlights important aspects of substantiating the benefits of our hierarchical design, and we will revise the manuscript accordingly to provide the requested quantitative analyses and ablations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the hierarchical design 'ensures both semantic accuracy and spatial robustness' and 'significantly outperforms' baselines is load-bearing on the assumption that the scene marker generator produces spatially precise targets from compositional instructions. No quantitative breakdown of marker localization error, no ablation of marker quality versus end-to-end success rates, and no failure-mode analysis across the nine SlotBench categories are referenced, leaving open the possibility that reported gains are driven primarily by the upstream grounding module rather than the proposed architecture.

    Authors: We agree that additional analysis is needed to isolate the contributions of the marker generator and the goal-conditioned policy. In the revised version, we will add a quantitative breakdown of marker localization error (including mean error and distributions across the nine task categories in SlotBench). We will also include an ablation comparing end-to-end success rates using generated markers versus oracle (perfect) markers to demonstrate the policy's role. Finally, we will expand the results section with a per-category failure-mode analysis to show where the hierarchical decoupling provides gains beyond the upstream module alone. revision: yes

  2. Referee: [Abstract] The weakest assumption noted in the stress-test—that reliable scene marker generation is always possible and that the policy achieves sub-centimeter accuracy without fine-tuning—directly affects the zero-shot claim. The manuscript provides no evidence (e.g., marker error distributions or policy corrective range analysis) that the goal-conditioned policy can recover from typical VLM grounding inaccuracies on compositional cases, which is required to substantiate the decoupling benefit.

    Authors: We acknowledge that explicit evidence for the policy's robustness to grounding inaccuracies would strengthen the zero-shot claims. While our current experiments demonstrate overall performance advantages in zero-shot settings, we did not include a dedicated analysis of recovery from marker errors. In the revision, we will add marker error distributions from the generator and evaluate the goal-conditioned policy's corrective range by testing performance under controlled perturbations to the visual goals (simulating typical VLM inaccuracies on compositional instructions). This will directly address the decoupling benefit. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper is an empirical proposal of a hierarchical VLA framework (AnySlot) plus a new benchmark (SlotBench). It contains no equations, derivations, first-principles predictions, or parameter-fitting steps that could reduce outputs to inputs by construction. Claims rest on experimental comparisons rather than any self-referential definitions or imported uniqueness theorems. This is the normal case for robotics system papers; the derivation chain is absent, so no circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical robotics framework with no mathematical derivations; no free parameters, axioms, or invented entities are identifiable from the provided abstract.

pith-pipeline@v0.9.0 · 5493 in / 1108 out tokens · 35368 ms · 2026-05-10T16:38:01.997535+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages · 9 internal anchors

  1. [1]

    PaliGemma: A versatile 3B VLM for transfer

    Beyer, L., Steiner, A., Pinto, A.S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., et al.: Paligemma: A ver- satile 3b vlm for transfer. arXiv preprint arXiv:2407.07726 (2024)

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., et al.:π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 (2024)

  3. [3]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., Feng, S., Gao, S., He, X., Hu, X., Huang, X., et al.: Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669 (2025)

  4. [4]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y., Liang, Q., Li, Z., Lin, X., Ge, Y., Gu, Z., et al.: Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation. arXiv preprint arXiv:2506.18088 (2025)

  5. [5]

    The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

    Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

  6. [6]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 91–104 (2025)

  7. [7]

    arXiv preprint arXiv:2404.13478 (2024)

    Eisner, B., Yang, Y., Davchev, T., Vecerik, M., Scholz, J., Held, D.: Deep se (3)-equivariant geometric reasoning for precise placement tasks. arXiv preprint arXiv:2404.13478 (2024)

  8. [8]

    Thinkact: Vision-language-action reasoning via reinforced visual latent planning.arXiv preprint arXiv:2507.16815,

    Huang, C.P., Wu, Y.H., Chen, M.H., Wang, Y.C.F., Yang, F.E.: Thinkact: Vision- language-action reasoning via reinforced visual latent planning. arXiv preprint arXiv:2507.16815 (2025)

  9. [9]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Huang, H., Chen, X., Chen, Y., Li, H., Han, X., Wang, Z., Wang, T., Pang, J., Zhao, Z.: Roboground: Robotic manipulation with grounded vision-language pri- ors. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22540–22550 (2025)

  10. [10]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al.:π0.5: a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 (2025)

  11. [11]

    In: Forty-first International Conference on Machine Learning (2024)

    Karamcheti, S., Nair, S., Balakrishna, A., Liang, P., Kollar, T., Sadigh, D.: Pris- matic vlms: Investigating the design space of visually-conditioned language models. In: Forty-first International Conference on Machine Learning (2024)

  12. [12]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P., Finn, C.: Openvla: An open-source vision- language-action model. arXiv preprint arXiv:2406.09246 (2024)

  13. [13]

    IEEE Robotics and Automation Letters11(3), 2506–2513 (2026)

    Li, C., Wen, J., Peng, Y., Peng, Y., Zhu, Y.: Pointvla: Injecting the 3d world into vision-language-action models. IEEE Robotics and Automation Letters11(3), 2506–2513 (2026)

  14. [14]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models

    Li, P., Chen, Y., Wu, H., Ma, X., Wu, X., Huang, Y., Wang, L., Kong, T., Tan, T.: Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models. arXiv preprint arXiv:2506.07961 (2025)

  15. [15]

    Evaluating Real-World Robot Manipulation Policies in Simulation

    Li, X., Hsu, K., Gu, J., Pertsch, K., Mees, O., Walke, H.R., Fu, C., Lunawat, I., Sieh, I., Kirmani, S., et al.: Evaluating real-world robot manipulation policies in simulation. arXiv preprint arXiv:2405.05941 (2024) AnySlot 17

  16. [16]

    Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration.arXiv preprint arXiv:2506.12723, 2025

    Li, Y., Meng, Y., Sun, Z., Ji, K., Tang, C., Fan, J., Ma, X., Xia, S., Wang, Z., Zhu, W.: Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration. arXiv preprint arXiv:2506.12723 (2025)

  17. [17]

    arXiv preprint arXiv:2502.05485 (2025)

    Li, Y., Deng, Y., Zhang, J., Jang, J., Memmel, M., Yu, R., Garrett, C.R., Ramos, F., Fox, D., Li, A., et al.: Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485 (2025)

  18. [18]

    Vip: Vision instructed pre-training for robotic manipu- lation.arXiv preprint arXiv:2410.07169, 2024

    Li, Z., Ren, L., Yang, J., Zhao, Y., Wu, X., Xu, Z., Bai, X., Zhao, H.: Vip: Vision instructed pre-training for robotic manipulation. arXiv preprint arXiv:2410.07169 (2024)

  19. [19]

    arXiv preprint arXiv:2511.01571 (2025)

    Liang, W., Sun, G., He, Y., Dong, J., Dai, S., Laptev, I., Khan, S., Cong, Y.: Pix- elvla: Advancing pixel-level understanding in vision-language-action model. arXiv preprint arXiv:2511.01571 (2025)

  20. [20]

    Advances in Neural Information Processing Systems36, 44776–44791 (2023)

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmark- ing knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems36, 44776–44791 (2023)

  21. [21]

    IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

    Mees, O., Hermann, L., Rosete-Beas, E., Burgard, W.: Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters7(3), 7327–7334 (2022)

  22. [22]

    IEEE Robotics and Automation Letters5(4), 5605–5612 (2020)

    Mitash, C., Shome, R., Wen, B., Boularias, A., Bekris, K.: Task-driven perception and manipulation for constrained placement of unknown objects. IEEE Robotics and Automation Letters5(4), 5605–5612 (2020)

  23. [23]

    In: European Conference on Computer Vision

    Mu, Y., Chen, T., Peng, S., Chen, Z., Gao, Z., Zou, Y., Lin, L., Xie, Z., Luo, P.: Robotwin:Dual-armrobotbenchmarkwithgenerativedigitaltwins(earlyversion). In: European Conference on Computer Vision. pp. 264–273. Springer (2024)

  24. [24]

    IEEE Robotics and Automation Letters6(3), 4377–4384 (2021)

    Newbury, R., He, K., Cosgun, A., Drummond, T.: Learning to place objects onto flat surfaces in upright orientations. IEEE Robotics and Automation Letters6(3), 4377–4384 (2021)

  25. [25]

    In: Faust, A., Hsu, D., Neumann, G

    Paxton, C., Xie, C., Hermans, T., Fox, D.: Predicting stable configurations for semantic placement of novel objects. In: Faust, A., Hsu, D., Neumann, G. (eds.) Conference on Robot Learning, 8-11 November 2021, London, UK. Proceedings of Machine Learning Research, vol. 164, pp. 806–815. PMLR (2021)

  26. [26]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  27. [27]

    arXiv preprint arXiv:2504.01959 (2025)

    Shan, D., Mo, K., Yang, W., Chao, Y.W., Fouhey, D., Fox, D., Mousavian, A.: Slot-level robotic placement via visual imitation from single human video. arXiv preprint arXiv:2504.01959 (2025)

  28. [28]

    Shi, L.X., Ichter, B., Equi, M., Ke, L., Pertsch, K., Vuong, Q., Tanner, J., Walling, A., Wang, H., Fusai, N., et al.: Hi robot: Open-ended instruction following with hierarchicalvision-language-actionmodels.arXivpreprintarXiv:2502.19417(2025)

  29. [29]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Shukor, M., Fini, E., da Costa, V.G.T., Cord, M., Susskind, J., El-Nouby, A.: Scaling laws for native multimodal models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–23 (2025)

  30. [30]

    In: 2022 International Conference on Robotics and Automation (ICRA)

    Simeonov, A., Du, Y., Tagliasacchi, A., Tenenbaum, J.B., Rodriguez, A., Agrawal, P., Sitzmann, V.: Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 6394–6400. IEEE (2022)

  31. [31]

    Accelerating vision-language-action model integrated with action chunking via parallel decoding.arXiv preprint arXiv:2503.02310, 2025

    Song, W., Chen, J., Ding, P., Zhao, H., Zhao, W., Zhong, Z., Ge, Z., Ma, J., Li, H.: Accelerating vision-language-action model integrated with action chunking via parallel decoding. arXiv preprint arXiv:2503.02310 (2025) 18 Hu et al

  32. [32]

    arXiv preprint arXiv:2412.03555 (2024) 1

    Steiner, A., Pinto, A.S., Tschannen, M., Keysers, D., Wang, X., Bitton, Y., Grit- senko, A., Minderer, M., Sherbondy, A., Long, S., et al.: Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555 (2024)

  33. [33]

    In: Tan, J., Toussaint, M., Darvish, K

    Sundaresan, P., Belkhale, S., Sadigh, D., Bohg, J.: KITE: keypoint-conditioned policies for semantic manipulation. In: Tan, J., Toussaint, M., Darvish, K. (eds.) Conference on Robot Learning, CoRL 2023, 6-9 November 2023, Atlanta, GA, USA. Proceedings of Machine Learning Research, vol. 229, pp. 1006–1021. PMLR (2023)

  34. [34]

    In: 8th Annual Conference on Robot Learning (2024)

    Wang, Y., Zhang, M., Li, Z., Kelestemur, T., Driggs-Campbell, K., Wu, J., Fei- Fei, L., Li, Y.: D3fields: Dynamic 3d descriptor fields for zero-shot generalizable rearrangement. In: 8th Annual Conference on Robot Learning (2024)

  35. [35]

    In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

    Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: SAPIEN: A simulated part-based interactive environment. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

  36. [36]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)

  37. [37]

    arXiv preprint arXiv:2406.10721 (2024)

    Yuan, W., Duan, J., Blukis, V., Pumacay, W., Krishna, R., Murali, A., Mousavian, A., Fox, D.: Robopoint: A vision-language model for spatial affordance prediction for robotics. arXiv preprint arXiv:2406.10721 (2024)

  38. [38]

    In: Tan, J., Toussaint, M., Darvish, K

    Yuan, W., Murali, A., Mousavian, A., Fox, D.: M2T2: multi-task masked trans- former for object-centric pick and place. In: Tan, J., Toussaint, M., Darvish, K. (eds.) Conference on Robot Learning, CoRL 2023, 6-9 November 2023, Atlanta, GA, USA. Proceedings of Machine Learning Research, vol. 229, pp. 3619–3630. PMLR (2023)

  39. [39]

    From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

    Yuan, Y., Cui, H., Chen, Y., Dong, Z., Ni, F., Kou, L., Liu, J., Li, P., Zheng, Y., Hao, J.: From seeing to doing: Bridging reasoning and decision for robotic manipulation. arXiv preprint arXiv:2505.08548 (2025)

  40. [40]

    In: The Fourteenth International Conference on Learning Representations

    Zhao, R., Xu, S., Jin, R., Deng, Y., Tai, Y., Jia, K., Liu, G.: Sim2real vla: Zero-shot generalization of synthesized skills to realistic manipulation. In: The Fourteenth International Conference on Learning Representations

  41. [41]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Zhao, T.Z., Kumar, V., Levine, S., Finn, C.: Learning fine-grained bimanual ma- nipulation with low-cost hardware. arXiv preprint arXiv:2304.13705 (2023)

  42. [42]

    Anyplace: Learning general- ized object placement for robot manipulation,

    Zhao, Y., Bogdanovic, M., Luo, C., Tohme, S., Darvish, K., Aspuru-Guzik, A., Shkurti, F., Garg, A.: Anyplace: learning generalized object placement for robot manipulation. arXiv preprint arXiv:2502.04531 (2025)

  43. [43]

    In: Conference on Robot Learning

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023)