AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

Daewon Chae; Jinkyu Kim; Jiyun Jang; Jungbeom Lee; Sangwon Lee; Sohwi Kim; Woosung Joung; Yujin Sung

arxiv: 2606.06761 · v1 · pith:XH7K36IKnew · submitted 2026-06-04 · 💻 cs.RO · cs.AI

AxisGuide: Grounding Robot Action Coordinate System in RGB Observations for Robust Visuomotor Manipulation

Jiyun Jang , Yujin Sung , Woosung Joung , Daewon Chae , Sangwon Lee , Sohwi Kim , Jinkyu Kim , Jungbeom Lee This is my paper

Pith reviewed 2026-06-28 00:44 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords visuomotor policiesaction coordinate groundingrobot manipulationRGB augmentationbase-frame axesbehavior cloninggeneralizationLIBERO benchmark

0 comments

The pith

AxisGuide renders robot base-frame axes into RGB images to help policies map actions to image space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visuomotor policies trained by behavior cloning understand scenes but often fail at low-level actions when object positions shift even slightly. The problem is that policies cannot reliably read what base-frame +x, +y, and +z motions look like in a given camera view. AxisGuide solves this by using camera parameters and end-effector poses to draw the base axes directly onto each RGB observation as extra cue channels. The added visuals let the policy learn the meaning of each action direction in pixel space without any change to network architecture or loss functions. Tests on the LIBERO benchmark and real robots show higher success rates and better handling of new locations and viewpoints.

Core claim

AxisGuide renders the robot base-frame axes in each camera view using known camera parameters and end-effector poses, then augments the RGB input with a small set of cue channels that explicitly visualize the meaning of +x, +y, and +z base-frame motions in image space. This explicit grounding bridges semantic scene understanding and action-coordinate interpretation, allowing standard behavior-cloned policies to execute reliable actions under distribution shifts.

What carries the argument

AxisGuide rendering of base-frame axes as additional cue channels in RGB observations.

Load-bearing premise

The rendered axes supply clear, non-conflicting visual information that a standard policy network can use to correctly interpret base-frame actions.

What would settle it

Train identical policies with and without the axis cues on LIBERO tasks that place objects at unseen locations, then measure whether success rates stay the same or improve only for the cued version.

Figures

Figures reproduced from arXiv: 2606.06761 by Daewon Chae, Jinkyu Kim, Jiyun Jang, Jungbeom Lee, Sangwon Lee, Sohwi Kim, Woosung Joung, Yujin Sung.

**Figure 1.** Figure 1: AxisGuide: Grounding Robot Action Coordinate System for Robust Manipulation. Conventional visuomotor policies (left) struggle to generalize beyond training data (blue squares), often failing at unseen locations (yellow box). In contrast, AxisGuide (right) enables robust task execution across a wide range of unseen spatial configurations. By explicitly associating the action space with image observations th… view at source ↗

**Figure 2.** Figure 2: An overview of AxisGuide. Using camera intrinsics and extrinsics, AxisGuide projects the robot base-frame x, y, and z axes onto the 2D image plane, centered at the gripper, and renders them as additional channels alongside RGB images from all cameras. This explicit visualization enables the policy to better understand the correspondence between visual observations and robot base-frame actions. work therefo… view at source ↗

**Figure 3.** Figure 3: Quantitative Results in the Multi-View Simulation Setup (LIBERO). We compare success rates of AxisGuide with baseline methods in single-task (left) and multi-task (right) settings using wrist and front cameras. Unlike the standard SmolVLA training pipeline [23], we train the full model including the image backbone to support additional coordinate cue channels. For fair comparison, we report both the action… view at source ↗

**Figure 2.** Figure 2: We project the robot base-frame axes into the image [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Real-world Manipulation Tasks for Evaluation. We show initial states (top) and goal states (bottom) for Pick & Place (Grape), Flip Pot, and Close Pot, which require different combinations of translational and rotational actions. upright, and (3) Close Pot: picking up the pot lid and closing the pot as shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Generalization to Novel Object Positions. (a) shows that the baseline DP (left) generalizes poorly, with success largely confined to regions near training data, whereas DP with AxisGuide (right) reliably reaches unseen object positions between clusters, which is consistent with the simulation results in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Rollout Behaviors Under Unseen Object Locations on LIBERO Simulation and Real-World Manipulation. In the Pick Up (Bowl) task in the LIBERO simulation (top), the baseline model (SmolVLA) fails to adapt its actions when the target bowl is placed at an unseen location. In contrast, the same model trained with AxisGuide precisely reaches the target by grounding the action coordinate system in the image-space. … view at source ↗

**Figure 7.** Figure 7: shows the UR5e robot setup used in our real-world experiments. Each real-world task is evaluated to test the policy’s ability to handle fine-grained control. All tasks use a 10Hz control frequency with RGB observations resized to 256 × 256. To verify that AxisGuide transfers to real-world deployment, we design a set of tabletop manipulation tasks that require precise action coordinate grounding and contact… view at source ↗

**Figure 8.** Figure 8: LIBERO simulation benchmark [18]. We evaluate Diffusion Policy [3] and SmolVLA [23] on LIBERO task suites, and additionally augment each policy with AxisGuide cues to study action coordinate grounding. We construct the object novel position generalization benchmark using the LIBERO Spatial suite. B. LIBERO Simulation Benchmark We evaluate our method in the LIBERO [18] simulation benchmark, which provides l… view at source ↗

**Figure 9.** Figure 9: LIBERO-Spatial task setup. Numbered boxes indicate the typical target-object region for each of the ten LIBEROSpatial tasks (0–9), where demonstrations place the object near the corresponding region. For our object novel position study, we train on the remaining regions (green) while excluding tasks 2 and 4 (red). At evaluation time, we progressively expand the test placement region outward from the task… view at source ↗

**Figure 11.** Figure 11: Viewpoint generalizability task setup. Visualization of the dataset constructed by varying the camera viewpoint from −45◦ to 45◦ in 22.5 ◦ increments. We use demonstrations from LIBERO-Spatial tasks 0 and 2 to train SmolVLA [23], and evaluate generalization to unseen viewpoints by testing viewpoints at 10◦ intervals. TABLE VI: Quantitative Comparison of Viewpoint Generalizability in the LIBERO Simulation… view at source ↗

read the original abstract

Visuomotor manipulation policies trained via large-scale behavior cloning have achieved strong semantic scene understanding, yet often fail to reliably execute correct low-level actions under distribution shifts. For example, even in a simple pickup task with identical scene layouts, camera viewpoints, and illumination, performance can degrade substantially when the object is placed at unseen locations. We argue that this gap arises from insufficient action understanding, namely the inability to interpret the robot's base-frame action coordinate system in image space. To address this issue, we introduce AxisGuide, a lightweight guidance method that bridges semantic scene understanding and action-coordinate interpretation. Using camera parameters and end-effector poses, AxisGuide renders the robot base-frame axes in each camera view and augments RGB observations with a small set of cue channels that explicitly visualize the meaning of the +x, +y, and +z motions in image space. Extensive evaluations in both the LIBERO simulation and real-world environments demonstrate that AxisGuide yields substantial performance gains and improved generalization, highlighting the effectiveness of explicit action-coordinate cues for learning reliable and transferable generalist visuomotor policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AxisGuide renders base-frame axes into observations as a lightweight cue for action coordinates, but the abstract gives no numbers and the real-world claims rest on untested calibration accuracy.

read the letter

The main point is that AxisGuide renders the robot's base-frame x, y, z axes into each camera view using intrinsics, extrinsics, and end-effector poses, then feeds those extra channels to the policy so it can see what the action directions actually mean in image space. This is presented as a direct response to policies that handle semantics but still fail on unseen object positions even under matched views and lighting.

The approach is straightforward and additive: no architecture changes, no extra losses, just rendered cues on top of standard behavior cloning. That keeps the barrier low for anyone already running visuomotor training.

The abstract claims substantial gains and better generalization in both LIBERO and real-world tests, yet supplies no success rates, baseline numbers, ablations, or statistical details. Without those, the size of the improvement and whether it survives different random seeds or slight distribution shifts stays unclear. The stress-test concern about calibration also lands: the method depends on accurate camera parameters and poses to produce non-misleading cues. Millimeter or degree errors are common in real setups, and if the paper does not quantify how the policy behaves under such noise, the real-world transfer story needs more support.

This is aimed at people working on behavior-cloned visuomotor policies who want a cheap way to improve coordinate grounding without redesigning the network. A reader already familiar with LIBERO-style benchmarks or sim-to-real gaps would get the most out of it.

I would send it to peer review. The problem it names is real and the proposed fix is simple enough to test, but the authors will need to add quantitative results and calibration sensitivity checks before the claims can be evaluated properly.

Referee Report

2 major / 2 minor

Summary. The paper claims that visuomotor policies trained via behavior cloning often fail to interpret the robot's base-frame action coordinate system in image space, leading to poor generalization under distribution shifts. To address this, it introduces AxisGuide, a lightweight method that uses camera parameters and end-effector poses to render the robot base-frame axes (+x, +y, +z) as additional cue channels overlaid on RGB observations. This explicit visualization is said to bridge semantic understanding and action-coordinate interpretation without architecture changes or auxiliary losses. Extensive evaluations in the LIBERO simulation benchmark and real-world environments are reported to show substantial performance gains and improved generalization for generalist visuomotor policies.

Significance. If the results hold, the work could be significant for robot learning by demonstrating that explicit, rendered action-coordinate cues can improve policy robustness and transfer without modifying the underlying network or training objective. The approach is lightweight and additive, and the dual evaluation in simulation (LIBERO) and real-world settings provides a concrete test of the idea. The absence of architecture changes or extra losses is a positive design choice that keeps the method practical for existing behavior-cloning pipelines.

major comments (2)

[§4] §4 (real-world experiments): The central claim that AxisGuide yields substantial generalization gains rests on the rendered axes supplying accurate, non-conflicting visual cues. However, the rendering depends on precise camera intrinsics/extrinsics and end-effector poses; the manuscript provides no quantitative analysis of sensitivity to typical real-world calibration errors (millimeter/degree level), which could produce misaligned cues that degrade rather than improve policy performance.
[Table 2 / Figure 5] Table 2 / Figure 5 (LIBERO and real-world results): The abstract and results claim 'substantial performance gains' and 'improved generalization,' yet the provided text does not report concrete metrics, baseline comparisons, statistical significance, or ablation studies isolating the contribution of the cue channels versus other factors; this makes it impossible to evaluate whether the gains are load-bearing or reproducible.

minor comments (2)

[§3] Notation for the rendered cue channels (e.g., how the three axis channels are normalized and concatenated to RGB) is described only at a high level; a precise equation or pseudocode would improve reproducibility.
[§3] The paper does not discuss whether the method assumes perfect end-effector pose estimates during both training and deployment; a short clarification on this assumption would help readers assess deployment feasibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [§4] §4 (real-world experiments): The central claim that AxisGuide yields substantial generalization gains rests on the rendered axes supplying accurate, non-conflicting visual cues. However, the rendering depends on precise camera intrinsics/extrinsics and end-effector poses; the manuscript provides no quantitative analysis of sensitivity to typical real-world calibration errors (millimeter/degree level), which could produce misaligned cues that degrade rather than improve policy performance.

Authors: We agree that sensitivity to calibration errors is an important practical consideration. The current manuscript does not include a quantitative analysis of this. In the revision we will add experiments that inject millimeter- and degree-level perturbations into camera intrinsics, extrinsics, and end-effector poses, re-render the axis cues, and measure the resulting change in policy success rates. This will directly test whether typical real-world calibration inaccuracies degrade or preserve the reported gains. revision: yes
Referee: [Table 2 / Figure 5] Table 2 / Figure 5 (LIBERO and real-world results): The abstract and results claim 'substantial performance gains' and 'improved generalization,' yet the provided text does not report concrete metrics, baseline comparisons, statistical significance, or ablation studies isolating the contribution of the cue channels versus other factors; this makes it impossible to evaluate whether the gains are load-bearing or reproducible.

Authors: Table 2 in the manuscript already reports per-task success rates for AxisGuide against the listed baselines on LIBERO, and Figure 5 reports real-world success rates. Ablation results isolating the cue channels appear in the supplementary material. We acknowledge, however, that statistical significance (standard deviations across seeds) and explicit isolation of the cue contribution are not sufficiently prominent in the main text. In the revision we will move key numerical results, baseline comparisons, and significance indicators into the main body and add a dedicated ablation subsection to improve evaluability and reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; AxisGuide is an independent input augmentation

full rationale

The paper introduces AxisGuide as a rendering step that augments RGB inputs with base-frame axis cues computed from camera parameters and end-effector poses. This preprocessing is external to the policy network and training loop. Claims of performance gains rest on empirical evaluations in LIBERO and real-world settings rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. No equations or steps reduce by construction to the inputs; the method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the approach relies on standard robotics assumptions about known poses rather than introducing new fitted parameters or entities.

axioms (1)

domain assumption Camera intrinsic/extrinsic parameters and end-effector poses are known and accurate enough to render base-frame axes correctly in each view.
Invoked to enable the rendering step described in the abstract.

pith-pipeline@v0.9.1-grok · 5751 in / 1206 out tokens · 37272 ms · 2026-06-28T00:44:30.290060+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 3 canonical work pages

[1]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[2]

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jas- mine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang- Huei Lee, Sergey Levine, Yao Lu, Ut...

work page doi:10.15607/rss.2023.xix.025 2023
[3]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[4]

Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies

Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jianing Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies. InConference on Robot Learning, pages 2409–2429. PMLR, 2025

2025
[5]

From intention to execution: Probing the gener- alization boundaries of vision-language-action models

Irving Fang, Juexiao Zhang, Shengbang Tong, and Chen Feng. From intention to execution: Probing the gener- alization boundaries of vision-language-action models. arXiv preprint arXiv:2506.09930, 2025. URL https: //arxiv.org/abs/2506.09930

arXiv 2025
[6]

Libero- plus: In-depth robustness analysis of vision-language- action models, 2025

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero- plus: In-depth robustness analysis of vision-language- action models, 2025. URL https://arxiv.org/abs/2510. 13626

2025
[7]

prentice hall professional technical reference, 2002

David A Forsyth and Jean Ponce.Computer vision: a modern approach. prentice hall professional technical reference, 2002

2002
[8]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023

arXiv 2023
[9]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[10]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025
[11]

π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[12]

Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, and Matthew R. Walter. Do you know where your camera is? View- invariant policy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025

arXiv 2025
[13]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...
[14]

doi: 10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024
[15]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, and W...

2025
[16]

Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Pith/arXiv arXiv 2025
[17]

Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

arXiv 2024
[18]

Hamster: Hierarchical action models for open-world robot manipulation

Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, and Ankit Goyal. Hamster: Hierarchical action models for open-world robot manipulation. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Representation Learning, volume 2025, pages 24040–24068, 2...

2025
[19]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023
[20]

Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera

Jingpei Lu, Zekai Liang, Tristin Xie, Florian Richter, Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1914–1920. IEEE, 2025

1914
[21]

Fmb: a functional manipulation benchmark for generalizable robotic learning.The International Journal of Robotics Research, 44(4):592–606, 2025

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning.The International Journal of Robotics Research, 44(4):592–606, 2025

2025
[22]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[23]

Behavior transformers: Cloningkmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloningkmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

2022
[24]

Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025
[25]

Be- havioral cloning from observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Be- havioral cloning from observation. InProceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4950–4957. In- ternational Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/687

work page doi:10.24963/ijcai.2018/687 2018
[26]

Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025
[27]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–
[28]

Robopoint: A vision- language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Ar- salan Mousavian, and Dieter Fox. Robopoint: A vision- language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

arXiv 2024
[29]

Dino: Detr with improved denoising anchor boxes for end-to- end object detection.arXiv preprint arXiv:2203.03605, 2022

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to- end object detection.arXiv preprint arXiv:2203.03605, 2022

Pith/arXiv arXiv 2022
[30]

X-vla: Soft-prompted trans- former as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted trans- former as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025
[31]

Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies. InThe Thirteenth International Conference on Learning Representations
[32]

pick upXand place itY

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. VII. SUPPLEMENTARYMATERIAL A. Real World Setup Details Fig. 7 shows the UR5e robot se...

arXiv 2023

[1] [1]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[2] [2]

Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jas- mine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang- Huei Lee, Sergey Levine, Yao Lu, Ut...

work page doi:10.15607/rss.2023.xix.025 2023

[3] [3]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[4] [4]

Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies

Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jianing Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies. InConference on Robot Learning, pages 2409–2429. PMLR, 2025

2025

[5] [5]

From intention to execution: Probing the gener- alization boundaries of vision-language-action models

Irving Fang, Juexiao Zhang, Shengbang Tong, and Chen Feng. From intention to execution: Probing the gener- alization boundaries of vision-language-action models. arXiv preprint arXiv:2506.09930, 2025. URL https: //arxiv.org/abs/2506.09930

arXiv 2025

[6] [6]

Libero- plus: In-depth robustness analysis of vision-language- action models, 2025

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. Libero- plus: In-depth robustness analysis of vision-language- action models, 2025. URL https://arxiv.org/abs/2510. 13626

2025

[7] [7]

prentice hall professional technical reference, 2002

David A Forsyth and Jean Ponce.Computer vision: a modern approach. prentice hall professional technical reference, 2002

2002

[8] [8]

Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977, 2023

arXiv 2023

[9] [9]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[10] [10]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Pith/arXiv arXiv 2025

[11] [11]

π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[12] [12]

Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, and Matthew R. Walter. Do you know where your camera is? View- invariant policy learning with camera conditioning.arXiv preprint arXiv:2510.02268, 2025

arXiv 2025

[13] [13]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abra- ham Le...

[14] [14]

doi: 10.15607/RSS.2024.XX.120

work page doi:10.15607/rss.2024.xx.120 2024

[15] [15]

Openvla: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. In Pulkit Agrawal, Oliver Kroemer, and W...

2025

[16] [16]

Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reason- ing models that can reason in space.arXiv preprint arXiv:2508.07917, 2025

Pith/arXiv arXiv 2025

[17] [17]

Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

arXiv 2024

[18] [18]

Hamster: Hierarchical action models for open-world robot manipulation

Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, and Ankit Goyal. Hamster: Hierarchical action models for open-world robot manipulation. In Y . Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Representation Learning, volume 2025, pages 24040–24068, 2...

2025

[19] [19]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023

[20] [20]

Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera

Jingpei Lu, Zekai Liang, Tristin Xie, Florian Richter, Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: Camera-to-robot pose estimation in real-world conditions using a single camera. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 1914–1920. IEEE, 2025

1914

[21] [21]

Fmb: a functional manipulation benchmark for generalizable robotic learning.The International Journal of Robotics Research, 44(4):592–606, 2025

Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning.The International Journal of Robotics Research, 44(4):592–606, 2025

2025

[22] [22]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[23] [23]

Behavior transformers: Cloningkmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transformers: Cloningkmodes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

2022

[24] [24]

Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025

[25] [25]

Be- havioral cloning from observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Be- havioral cloning from observation. InProceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4950–4957. In- ternational Joint Conferences on Artificial Intelligence Organization, 7 2018. doi: 10.24963/ijcai.2018/687

work page doi:10.24963/ijcai.2018/687 2018

[26] [26]

Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

Pith/arXiv arXiv 2025

[27] [27]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–

[28] [28]

Robopoint: A vision- language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Ar- salan Mousavian, and Dieter Fox. Robopoint: A vision- language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

arXiv 2024

[29] [29]

Dino: Detr with improved denoising anchor boxes for end-to- end object detection.arXiv preprint arXiv:2203.03605, 2022

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to- end object detection.arXiv preprint arXiv:2203.03605, 2022

Pith/arXiv arXiv 2022

[30] [30]

X-vla: Soft-prompted trans- former as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, et al. X-vla: Soft-prompted trans- former as scalable cross-embodiment vision-language- action model.arXiv preprint arXiv:2510.10274, 2025

Pith/arXiv arXiv 2025

[31] [31]

Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies. InThe Thirteenth International Conference on Learning Representations

[32] [32]

pick upXand place itY

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. VII. SUPPLEMENTARYMATERIAL A. Real World Setup Details Fig. 7 shows the UR5e robot se...

arXiv 2023