arxiv: 2605.11809 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Beyond World-Frame Action Heads: Motion-Centric Action Frames for Vision-Language-Action Models

Huoren Yang, Hu Yusong, Jianchao Zhao, Qiguan Ou, SongLin Dong, Wei Ke, Yihong Gong, Yuhang He, Yuyang Gao, Zhiheng Ma

Pith reviewed 2026-05-13 06:02 UTC · model grok-4.3

classification 💻 cs.AI

keywords Vision-Language-Action modelsMotion-Centric Action Framesaction parameterizationrobotic manipulationemergent structurerobustness to perturbationsprototype-based actionslocal coordinate frames

0 comments

The pith

VLA models gain robustness by learning local motion frames instead of predicting world-frame actions directly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that vision-language-action models can organize robotic manipulation more effectively by replacing direct world-frame action prediction with a lightweight head that predicts a local rotation and composes actions from prototypes inside that frame. This design uses only ordinary demonstration data with no extra labels or supervision. The resulting local frames develop axes that align with end-effector motion directions, while the action representations become more compact and regularly structured. These changes produce better performance under geometric perturbations. A reader would care because the change is simple yet appears to improve generalization and reliability in manipulation tasks.

Core claim

By predicting a rotation R_t in SO(3) at each step, composing actions from a set of prototypes inside the transformed local frame, and mapping the result back to world coordinates for end-to-end training, the policy induces stable emergent geometric structure whose axes match demonstrated end-effector motion; actions become substantially more compact with variation captured by fewer dominant directions organized by shared prototypes; and robustness improves under geometric perturbations.

What carries the argument

The Motion-Centric Action Frame (MCF) together with prototype-based parameterization: the policy outputs a rotation to define the local frame, selects and combines prototypes inside it, and transforms the resulting action back to the world frame.

If this is right

Local frames develop a stable geometric structure whose axes are strongly compatible with demonstrated end-effector motion without any explicit directional supervision.
Actions in the learned representation become substantially more compact, with variation captured by fewer dominant directions and more regularly organized by shared prototypes.
These structural properties produce improved robustness, especially under geometric perturbations.
Adding lightweight geometric and compositional structure to the action head improves how VLA policies organize and generalize robotic manipulation behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The emergent alignment suggests that the action head can discover motion-relevant directions from trajectory statistics alone, potentially reducing reliance on hand-engineered coordinate frames.
Compact prototype-based representations could make it easier to transfer policies across robots or environments that share similar motion patterns but differ in absolute scale or orientation.
The approach might generalize beyond manipulation to other continuous control domains where local frame prediction could induce useful structure without task-specific labels.

Load-bearing premise

That predicting only a rotation in SO(3) and composing actions from prototypes inside the resulting local frame will cause the learned axes to align with end-effector motion and deliver robustness gains when trained on standard demonstration trajectories alone.

What would settle it

Training an identical VLA backbone with the new head versus a standard world-frame head on the same datasets and measuring whether axis alignment with end-effector motion appears and whether success rates drop less under controlled rotations or translations of the scene.

Figures

Figures reproduced from arXiv: 2605.11809 by Huoren Yang, Hu Yusong, Jianchao Zhao, Qiguan Ou, SongLin Dong, Wei Ke, Yihong Gong, Yuhang He, Yuyang Gao, Zhiheng Ma.

**Figure 2.** Figure 2: Overview of MCF-Proto. Given observations and task instruction, the policy backbone [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Learned MCF visualization on LIBERO. The predicted MCF provides a smooth local basis [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of action distributions in the world frame and the learned local frame. Com [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The predicted local frame maintains a stable local basis whose axes remain geometrically [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Prototype usage distribution of the learned action model across LIBERO-long tasks (K = [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Dominant MCF axis over time for task "pick up the book and place it in the black [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Additional visualization results of MCF-Proto [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have advanced rapidly with stronger backbones, broader pre-training, and larger demonstration datasets, yet their action heads remain largely homogeneous: most directly predict action commands in a fixed world coordinate frame. We propose \textbf{MCF-Proto}, a lightweight action head that equips VLA policies with a Motion-Centric Action Frame (MCF) and a prototype-based action parameterization. At each step, the policy predicts a rotation $R_t \in SO(3)$, composes actions in the transformed local frame from a set of prototypes, and maps them back to the world frame for end-to-end training, using only standard demonstrations without auxiliary supervision. This simple design induces stable emergent structure. Without explicit directional labels, the learned local frames develop a stable geometric structure whose axes are strongly compatible with demonstrated end-effector motion. Meanwhile, actions in the learned representation become substantially more compact, with variation captured by fewer dominant directions and more regularly organized by shared prototypes. These structural properties translate into improved robustness, especially under geometric perturbations. Our results suggest that adding lightweight geometric and compositional structure to the action head can materially improve how VLA policies organize and generalize robotic manipulation behavior. An anonymized code repository is provided in the supplementary material.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MCF-Proto adds a predicted local rotation and prototype composition to VLA action heads, but the claimed emergent motion alignment lacks an explicit mechanism and may not hold beyond the training distribution.

read the letter

The paper introduces MCF-Proto, a lightweight action head that predicts a rotation R_t in SO(3) at each step, composes actions from a small set of prototypes inside that local frame, and transforms the result back to world coordinates. Training stays end-to-end on ordinary demonstration trajectories with no auxiliary losses or labels. This is a straightforward architectural tweak that tries to give the policy a more structured way to represent actions than the usual fixed world-frame regression. It does one thing cleanly: it keeps the training recipe unchanged while adding an explicit geometric degree of freedom. If the experiments show consistent gains on geometric perturbations, that would be useful for anyone trying to make VLA policies more robust without bigger models or more data. The soft spot is exactly the one the stress-test note flags. Nothing in the loss forces the learned R_t to align with end-effector motion directions; any rotation that lets the prototypes reconstruct the observed actions is admissible. The abstract claims stable emergent structure and compactness, but without ablations that isolate the frame prediction or controls for initialization and data bias, it is hard to know whether the alignment is a reliable outcome of the design or an artifact. The paper is aimed at people working on action heads for robotic VLA models who want a minimal change that might improve generalization. It is coherent enough on its own terms to deserve peer review, though the referees will need to see the quantitative results, error bars, and ablations before the robustness claims can be taken at face value. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper proposes MCF-Proto, a lightweight action head for Vision-Language-Action (VLA) models. At each timestep the policy predicts a rotation R_t in SO(3), composes actions from a fixed set of prototypes inside the resulting local Motion-Centric Action Frame, and transforms the resulting action back to the world frame. Training uses only the standard world-frame action regression loss on ordinary demonstration trajectories, with no auxiliary supervision, directional labels, or stability regularizers. The central claim is that this architecture induces stable emergent geometric structure whose axes align with demonstrated end-effector motion, yields more compact action representations, and improves robustness under geometric perturbations.

Significance. If the empirical claims are substantiated, the work would demonstrate that a minimal geometric and compositional change to the action head can produce measurable improvements in organization and robustness of VLA policies without extra data or losses. The emergence of motion-aligned frames from standard end-to-end training would be a notable result for robotic manipulation.

major comments (2)

[Section 3] Section 3 (MCF-Proto definition): the parameterization allows any R_t that permits prototype reconstruction of observed actions; no auxiliary term, orthogonality constraint, or stability regularizer is introduced, so the optimization has no explicit pressure to select motion-aligned frames. The manuscript must therefore supply either a formal argument or controlled ablations isolating the source of the claimed alignment.
[Section 5] Section 5 (experimental results): the abstract and results claim improved robustness under geometric perturbations and more compact action representations, yet the provided text supplies no quantitative metrics, error bars, baseline comparisons, or precise description of how perturbations were generated. These details are load-bearing for the central claim and must be added.

minor comments (2)

[Section 3] Notation: the number of prototypes, their initialization, and whether they are learned or fixed should be stated explicitly in the method section.
[Abstract] The supplementary code repository is mentioned but no access instructions or anonymized link are provided.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and will revise the manuscript to strengthen the presentation of the claimed emergence and empirical support.

read point-by-point responses

Referee: [Section 3] Section 3 (MCF-Proto definition): the parameterization allows any R_t that permits prototype reconstruction of observed actions; no auxiliary term, orthogonality constraint, or stability regularizer is introduced, so the optimization has no explicit pressure to select motion-aligned frames. The manuscript must therefore supply either a formal argument or controlled ablations isolating the source of the claimed alignment.

Authors: We agree that the current manuscript provides neither a formal proof that the learned R_t must align with motion nor controlled ablations that isolate the contribution of the prototype parameterization. In the revision we will add a dedicated ablation subsection containing: (i) a direct comparison of learned frame axes against the principal components of end-effector velocity in the demonstration data, (ii) training runs that disable the prototype layer while retaining the predicted rotation, and (iii) plots tracking the alignment metric over the course of training. These experiments will clarify whether the observed geometric structure arises specifically from the interaction between the rotation head and the prototype reconstruction objective. revision: yes
Referee: [Section 5] Section 5 (experimental results): the abstract and results claim improved robustness under geometric perturbations and more compact action representations, yet the provided text supplies no quantitative metrics, error bars, baseline comparisons, or precise description of how perturbations were generated. These details are load-bearing for the central claim and must be added.

Authors: The referee correctly notes that the main text as submitted lacks the quantitative numbers, error bars, and perturbation protocol. The supplementary material already contains the full tables (success rates with mean and standard deviation over five random seeds for MCF-Proto versus world-frame and other baselines) together with PCA-based compactness metrics. We will move the key tables and figures into the main body, add an explicit paragraph describing the perturbation generation procedure (random rotations drawn uniformly from SO(3) with maximum angle 30 degrees applied to the observation frame), and ensure all plots display error bars. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal with empirical emergence claims

full rationale

The paper introduces MCF-Proto as a lightweight action head that predicts R_t in SO(3) and composes prototype actions in the local frame before mapping back to world coordinates. All training uses only the standard end-to-end action regression loss on ordinary demonstration trajectories, with no auxiliary terms, constraints, or regularizers on frame alignment. The claimed stable geometric structure and axis compatibility with end-effector motion are presented strictly as observed outcomes of this training process rather than as quantities derived by construction, fitted parameters renamed as predictions, or results justified solely by self-citation. No equations or uniqueness theorems in the manuscript reduce the reported compactness, organization, or robustness gains to the input design itself; the central claims therefore remain independent of the inputs and are subject to external experimental validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The proposal rests on the introduction of a new local frame and prototype set whose alignment with motion is treated as an emergent outcome rather than an explicitly supervised target.

invented entities (2)

Motion-Centric Action Frame no independent evidence
purpose: Defines a per-step local coordinate system in which actions are composed before being mapped back to world coordinates
Introduced as the core geometric component of the new action head; no independent evidence supplied beyond the training procedure itself.
Prototype-based action parameterization no independent evidence
purpose: Represents actions as combinations drawn from a learned set of prototype vectors inside the local frame
New representational choice claimed to produce more compact action distributions; no external validation provided.

pith-pipeline@v0.9.0 · 5560 in / 1406 out tokens · 84357 ms · 2026-05-13T06:02:52.197603+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
predicts a rotation R_t ∈ SO(3) to define a Motion-Centric Action Frame (MCF). Within this frame, actions are generated as soft combinations of a small shared prototype dictionary... mapped back to the world frame
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
actions expressed in the learned representation become substantially more compact, with variation captured by fewer dominant directions

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 10 internal anchors

[1]

Learning from demonstrations with partially observable task parameters

Tohid Alizadeh, Sylvain Calinon, and Darwin G Caldwell. Learning from demonstrations with partially observable task parameters. In2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3309–3314. IEEE, 2014

work page 2014
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.π0: A visio...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Pact: Perception-action causal transformer for autoregressive robotics pre-training

Rogerio Bonatti, Sai Vemprala, Shuang Ma, Felipe Frujeri, Shuhang Chen, and Ashish Kapoor. Pact: Perception-action causal transformer for autoregressive robotics pre-training. In2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 3621–3627. IEEE, 2023

work page 2023
[4]

RoboCat : A self-improving foundation agent for robotic manipulation

Konstantinos Bousmalis, Giulia Vezzani, Dushyant Rao, Coline Devin, Alex X Lee, Maria Bauzá, Todor Davchev, Yuxiang Zhou, Agrim Gupta, Akhil Raju, et al. Robocat: A self-improving generalist agent for robotic manipulation.arXiv preprint arXiv:2306.11706, 2023

work page arXiv 2023
[5]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

A tutorial on task-parameterized movement learning and retrieval.Intelligent service robotics, 9(1):1–29, 2016

Sylvain Calinon. A tutorial on task-parameterized movement learning and retrieval.Intelligent service robotics, 9(1):1–29, 2016

work page 2016
[7]

A task-parameterized probabilistic model with minimal intervention control

Sylvain Calinon, Danilo Bruno, and Darwin G Caldwell. A task-parameterized probabilistic model with minimal intervention control. In2014 IEEE International Conference on Robotics and Automation (ICRA), pages 3339–3344. IEEE, 2014

work page 2014
[8]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model, 2025. URLhttps://arxiv.org/abs/2506.21539

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Polarnet: 3d point clouds for language- guided robotic manipulation.arXiv preprint arXiv:2309.15596, 2023

Shizhe Chen, Ricardo Garcia, Cordelia Schmid, and Ivan Laptev. Polarnet: 3d point clouds for language- guided robotic manipulation.arXiv preprint arXiv:2309.15596, 2023

work page arXiv 2023
[10]

Manipulation-oriented object perception in clutter through affordance coordinate frames

Xiaotong Chen, Kaizhi Zheng, Zhen Zeng, Cameron Kisailus, Shreshtha Basu, James Cooney, Jana Pavlasek, and Odest Chadwicke Jenkins. Manipulation-oriented object perception in clutter through affordance coordinate frames. In2022 IEEE-RAS 21st International Conference on Humanoid Robots (Humanoids), pages 186–193. IEEE, 2022

work page 2022
[11]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

work page 2025
[12]

From play to policy: Conditional behavior generation from uncurated robot data

Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=c7rM7F7jQjN

work page 2023
[13]

Deep object-centric representations for generalizable robot learning

Coline Devin, Pieter Abbeel, Trevor Darrell, and Sergey Levine. Deep object-centric representations for generalizable robot learning. In2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7111–7118. IEEE, 2018

work page 2018
[14]

Deep se(3)- equivariant geometric reasoning for precise placement tasks

Ben Eisner, Yi Yang, Todor Davchev, Mel Vecerik, Jonathan Scholz, and David Held. Deep se(3)- equivariant geometric reasoning for precise placement tasks. In B. Kim, Y . Yue, S. Chaudhuri, K. Fragki- adaki, M. Khan, and Y . Sun, editors,International Conference on Learning Representations, volume 2024, pages 26866–26886, 2024. URLhttps://proceedings.iclr....

work page 2024
[15]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025. 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Manipulation primitives: A paradigm for abstraction and execution of grasping and manipulation tasks.Robotics and Autonomous Systems, 61 (3):283–296, 2013

Javier Felip, Janne Laaksonen, Antonio Morales, and Ville Kyrki. Manipulation primitives: A paradigm for abstraction and execution of grasping and manipulation tasks.Robotics and Autonomous Systems, 61 (3):283–296, 2013

work page 2013
[17]

Implicit behavioral cloning

Pete Florence, Corey Lynch, Andy Zeng, Oscar A Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. Implicit behavioral cloning. InConference on robot learning, pages 158–168. PMLR, 2022

work page 2022
[18]

Constrained probabilistic movement primitives for robot trajectory adaptation.IEEE Transactions on Robotics, 38(4):2276–2294, 2021

Felix Frank, Alexandros Paraschos, Patrick van der Smagt, and Botond Cseke. Constrained probabilistic movement primitives for robot trajectory adaptation.IEEE Transactions on Robotics, 38(4):2276–2294, 2021

work page 2021
[19]

Rvt: Robotic view transformer for 3d object manipulation

Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, and Dieter Fox. Rvt: Robotic view transformer for 3d object manipulation. InConference on Robot Learning, pages 694–710. PMLR, 2023

work page 2023
[20]

Nora: A small open-sourced generalist vision language action model for embodied tasks,

Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U-Xuan Tan, Navonil Majumder, and Soujanya Poria. Nora: A small open-sourced generalist vision language action model for embodied tasks,

work page
[21]

URLhttps://arxiv.org/abs/2504.19854

work page arXiv
[22]

Dynamical movement primitives: learning attractor models for motor behaviors.Neural computation, 25(2):328–373, 2013

Auke Jan Ijspeert, Jun Nakanishi, Heiko Hoffmann, Peter Pastor, and Stefan Schaal. Dynamical movement primitives: learning attractor models for motor behaviors.Neural computation, 25(2):328–373, 2013

work page 2013
[23]

Bc-z: Zero-shot task generalization with robotic imitation learning

Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. Inconference on Robot Learning, pages 991–1002. PMLR, 2022

work page 2022
[24]

A unified approach for motion and force control of robot manipulators: The operational space formulation.IEEE Journal on Robotics and Automation, 3(1):43–53, 1987

Oussama Khatib. A unified approach for motion and force control of robot manipulators: The operational space formulation.IEEE Journal on Robotics and Automation, 3(1):43–53, 1987

work page 1987
[25]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024

work page arXiv 2024
[28]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

work page 2023
[29]

Manipulation task primitives for composing robot skills

J Daniel Morrow and Pradeep K Khosla. Manipulation task primitives for composing robot skills. In Proceedings of International Conference on Robotics and Automation, volume 4, pages 3354–3359. IEEE, 1997

work page 1997
[30]

Operational space control: A theoretical and empirical comparison.The International Journal of Robotics Research, 27(6):737–757, 2008

Jun Nakanishi, Rick Cory, Michael Mistry, Jan Peters, and Stefan Schaal. Operational space control: A theoretical and empirical comparison.The International Journal of Robotics Research, 27(6):737–757, 2008

work page 2008
[31]

GR00T N1: An open foundation model for generalist humanoid robots

NVIDIA, Johan Bjorck, Nikita Cherniadev Fernando Castañeda, Xingye Da, Runyu Ding, Linxi "Jim" Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, Joel Jang, Zhenyu Jiang, Jan Kautz, Kaushil Kundalia, Lawrence Lao, Zhiqi Li, Zongyu Lin, Kevin Lin, Guilin Liu, Edith Llontop, Loic Magne, Ajay Mandlekar, Avnish Narayan, Soroush Nasiriany, Scott Reed, You L...

work page 2025
[32]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[33]

Probabilistic movement primitives.Advances in neural information processing systems, 26, 2013

Alexandros Paraschos, Christian Daniel, Jan R Peters, and Gerhard Neumann. Probabilistic movement primitives.Advances in neural information processing systems, 26, 2013. 11

work page 2013
[34]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models, 2025. URL https://arxiv.org/abs/2501.09747

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Deep probabilistic move- ment primitives with a bayesian aggregator

Michael Przystupa, Faezeh Haghverd, Martin Jagersand, and Samuele Tosatto. Deep probabilistic move- ment primitives with a bayesian aggregator. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3704–3711. IEEE, 2023

work page 2023
[36]

FLOWER: Democratizing generalist robot policies with efficient vision-language-flow models

Moritz Reuss, Hongyi Zhou, Marcel Rühle, Ömer Erdinç Ya˘gmurlu, Fabian Otto, and Rudolf Lioutikov. FLOWER: Democratizing generalist robot policies with efficient vision-language-flow models. In9th An- nual Conference on Robot Learning, 2025. URLhttps://openreview.net/forum?id=JeppaebLRD

work page 2025
[37]

Learned parametrized dynamic movement primitives with shared synergies for controlling robotic and musculoskeletal systems.Frontiers in computational neuroscience, 7: 138, 2013

Elmar Rückert and Andrea d’Avella. Learned parametrized dynamic movement primitives with shared synergies for controlling robotic and musculoskeletal systems.Frontiers in computational neuroscience, 7: 138, 2013

work page 2013
[38]

Equivariant descriptor fields: Se (3)-equivariant energy-based models for end-to-end visual robotic manipulation learning.arXiv preprint arXiv:2206.08321, 2022

Hyunwoo Ryu, Hong-in Lee, Jeong-Hoon Lee, and Jongeun Choi. Equivariant descriptor fields: Se (3)-equivariant energy-based models for end-to-end visual robotic manipulation learning.arXiv preprint arXiv:2206.08321, 2022

work page arXiv 2022
[39]

Behavior transform- ers: Cloning k modes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

Nur Muhammad Shafiullah, Zichen Cui, Ariuntuya Arty Altanzaya, and Lerrel Pinto. Behavior transform- ers: Cloning k modes with one stone.Advances in neural information processing systems, 35:22955–22968, 2022

work page 2022
[40]

Learning to compose hierarchical object-centric controllers for robotic manipulation.arXiv preprint arXiv:2011.04627, 2020

Mohit Sharma, Jacky Liang, Jialiang Zhao, Alex LaGrassa, and Oliver Kroemer. Learning to compose hierarchical object-centric controllers for robotic manipulation.arXiv preprint arXiv:2011.04627, 2020

work page arXiv 2011
[41]

Distilled feature fields enable few-shot language-guided manipulation.arXiv preprint arXiv:2308.07931, 2023

William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation.arXiv preprint arXiv:2308.07931, 2023

work page arXiv 2023
[42]

Cliport: What and where pathways for robotic manipula- tion

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Cliport: What and where pathways for robotic manipula- tion. InConference on robot learning, pages 894–906. PMLR, 2022

work page 2022
[43]

Perceiver-actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785–799. PMLR, 2023

work page 2023
[44]

Neural descriptor fields: Se (3)-equivariant object representations for manipulation

Anthony Simeonov, Yilun Du, Andrea Tagliasacchi, Joshua B Tenenbaum, Alberto Rodriguez, Pulkit Agrawal, and Vincent Sitzmann. Neural descriptor fields: Se (3)-equivariant object representations for manipulation. In2022 International Conference on Robotics and Automation (ICRA), pages 6394–6400. IEEE, 2022

work page 2022
[45]

Se (3)-equivariant relational rearrangement with neural descriptor fields

Anthony Simeonov, Yilun Du, Yen-Chen Lin, Alberto Rodriguez Garcia, Leslie Pack Kaelbling, Tomás Lozano-Pérez, and Pulkit Agrawal. Se (3)-equivariant relational rearrangement with neural descriptor fields. InConference on Robot Learning, pages 835–846. PMLR, 2023

work page 2023
[46]

Interactive post-training for vision-language- action models, 2025

Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language- action models, 2025. URLhttps://arxiv.org/abs/2505.17016

work page arXiv 2025
[47]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Unified vision-language-action model, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model, 2025. URL https://arxiv.org/abs/2506. 19850

work page 2025
[49]

Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. arXiv preprint arXiv:2312.13139, 2023

work page internal anchor Pith review arXiv 2023
[50]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipula- tion with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning, 2025

Hongyi Zhou, Weiran Liao, Xi Huang, Yucheng Tang, Fabian Otto, Xiaogang Jia, Xinkai Jiang, Simon Hilber, Ge Li, Qian Wang, Ömer Erdinç Ya ˘gmurlu, Nils Blank, Moritz Reuss, and Rudolf Lioutikov. Beast: Efficient tokenization of b-splines encoded action sequences for imitation learning, 2025. URL https://arxiv.org/abs/2506.06072. 12

work page arXiv 2025
[52]

Vima: General robot manipulation with multimodal prompts

Y Zhu et al. Vima: General robot manipulation with multimodal prompts. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[53]

Learning generalizable manipulation policies with object-centric 3d representations.arXiv preprint arXiv:2310.14386, 2023

Yifeng Zhu, Zhenyu Jiang, Peter Stone, and Yuke Zhu. Learning generalizable manipulation policies with object-centric 3d representations.arXiv preprint arXiv:2310.14386, 2023

work page arXiv 2023
[54]

pick up the book and place it in the black compartment of the caddy

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. Anonymous Code Repository An anonymized code repository for reproducing the experiment...

work page 2023