Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision

Jiabing Yang; Kai Wang; Liang Wang; Peiyan Li; Qisen Ma; Yan Huang; Yixiang Chen; Yuan Xu

arxiv: 2606.26801 · v1 · pith:TZGEQSTWnew · submitted 2026-06-25 · 💻 cs.RO · cs.CV

Improving Vision-Language-Action Model Fine-Tuning with Structured Stage and Keyframe Supervision

Yuan Xu , Yixiang Chen , Kai Wang , Jiabing Yang , Peiyan Li , Qisen Ma , Yan Huang , Liang Wang This is my paper

Pith reviewed 2026-06-26 04:55 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords vision-language-action modelsrobotic manipulationfine-tuningstage supervisionkeyframe predictionauxiliary supervisiongripper eventslong-horizon tasks

0 comments

The pith

Structured stage and keyframe supervision derived automatically from gripper states improves VLA fine-tuning success rates on robotic manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that uniform action supervision during VLA fine-tuning causes failures to concentrate at gripper-event transitions in long-horizon sequences. It proposes StaKe as a plug-in method that automatically extracts a stage classifier and a keyframe predictor from demonstration gripper recordings to supply auxiliary training signals. These lightweight heads enrich the learned representations while leaving the base policy and inference unchanged. A sympathetic reader would care because the approach yields higher success rates without manual labels or architectural changes, and the gains grow with task length and transition count.

Core claim

StaKe automatically derives two complementary signals from demonstration gripper states without manual annotation: a stage classifier that identifies the current manipulation stage and a keyframe predictor that estimates the target joint action at the next gripper transition. Both are implemented as lightweight auxiliary heads that enrich representations during training. Experiments on bimanual simulation and single-arm Franka real-robot tasks show consistent success-rate improvements with relative gains of 14 percent and 56 percent respectively, with larger benefits on longer-horizon tasks that involve more gripper-event transitions.

What carries the argument

The StaKe plug-in auxiliary heads that supply stage classification and keyframe prediction signals derived from gripper states to structure VLA fine-tuning.

If this is right

Success rates rise by relative 14 percent on bimanual simulation tasks.
Success rates rise by relative 56 percent on single-arm Franka real-robot tasks.
Improvements are larger on longer-horizon tasks that contain more gripper-event transitions.
The base VLA policy architecture and inference loop stay unchanged.
Learned representations track manipulation stages according to qualitative analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same gripper-derived auxiliary signals could be tested on other sequential robot behaviors that lack clear stage boundaries.
Combining StaKe with existing VLA fine-tuning methods might compound gains on complex multi-step tasks.
Evaluating the method on tasks with noisy or inconsistent gripper recordings would test robustness to imperfect demonstration data.

Load-bearing premise

Gripper states recorded in demonstrations can be automatically segmented into meaningful manipulation stages and keyframes that provide useful auxiliary supervision signals without manual annotation or task-specific engineering.

What would settle it

Applying StaKe to a new set of manipulation tasks where gripper state changes do not align with distinct stages and measuring whether success rates increase, remain flat, or decrease.

Figures

Figures reproduced from arXiv: 2606.26801 by Jiabing Yang, Kai Wang, Liang Wang, Peiyan Li, Qisen Ma, Yan Huang, Yixiang Chen, Yuan Xu.

**Figure 1.** Figure 1: Manipulation failures concentrate around gripper-state transitions. A motion-fail occurs when the robot fails to reach and grasp the target during free-space motion, while a skill-fail occurs when the robot grasps successfully but fails during post-grasp skill execution (e.g., handover, stacking, rotation). Pre-trained VLM Action Expert V+L Tokens Noise Actions Denoising Query Stage Head Keyframe Head Pre… view at source ↗

**Figure 2.** Figure 2: Auxiliary supervision introduced by StaKe. Two learnable query tokens and their corresponding heads (highlighted) are added to a VLA model, providing stage classification and keyframe prediction signals during training without altering the action expert or inference loop. uniformly over the entire trajectory [13], [14]. Yet a manipulation trajectory is not homogeneous, but typically consists of several st… view at source ↗

**Figure 3.** Figure 3: Overview of the StaKe framework. Two learnable query tokens are appended to the pre-trained VLM backbone, each feeding a lightweight auxiliary head: a stage classification head trained with binary cross-entropy loss and a keyframe prediction head trained with L1 loss. Both heads are active only during training (marked “Training Only”) and jointly optimized with the flow-matching policy loss. At inference t… view at source ↗

**Figure 4.** Figure 4: Qualitative analysis (Q3). Three tasks of increasing complexity (rows, top to bottom): Lift Pot (2 KF), Handover Block (4 KF), and Stack Blocks Two (5 KF). Left column: task execution RGB sequence. Middle column: ground-truth dual-arm gripper states with annotated stage boundaries (green dashed) and keyframe positions (gray dotted). Right column: per-action-chunk predicted stage label (bar color: blue = Mo… view at source ↗

**Figure 5.** Figure 5: Real-robot experiments (Q4). Upper left: workspace with a Franka Research 3 arm and a front-view ZED 2i camera. Middle: start frame and keyframe snapshots for each task (top to bottom: Mug on Plate, Stack Bowls, Bowl in Drawer, Two Mugs in Basket). Right: cumulative success rate at each keyframe over 10 trials for π0.5 and StaKe. Table I that StaKe yields larger gains on multi-keyframe tasks, as the heads … view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have shown strong potential for generalizable robotic manipulation. During fine-tuning, however, action supervision applies equally across all timesteps, without structured supervision on which manipulation stage the robot is in or what the next gripper-event target should be. This causes failures to concentrate around challenging gripper-event transitions. To address this, we propose StaKe, a plug-in auxiliary supervision framework that automatically derives two complementary signals from demonstration gripper states without manual annotation: a stage classifier that identifies the current manipulation stage, and a keyframe predictor that estimates the target joint action at the next gripper transition. Both are modeled as lightweight auxiliary heads that enrich the learned representations during training, while leaving the base VLA policy architecture and inference loop unchanged. Experiments on bimanual simulation and single-arm Franka real-robot tasks show that StaKe consistently improves success rates (relative gains of 14% and 56%, respectively), with larger improvements on longer-horizon tasks that involve more gripper-event transitions. Ablation studies validate each design choice, and qualitative analysis confirms that the learned representations faithfully track manipulation stages. These results indicate that structured supervision is an effective and general strategy for enhancing VLA fine-tuning in long-horizon manipulation. Project website: https://hi-yuanxu.github.io/StaKe-Web/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StaKe adds two auxiliary heads from gripper logs to VLA fine-tuning and reports clear gains on long-horizon tasks without changing inference.

read the letter

The core idea is straightforward. The authors extract manipulation stages and next-gripper-event targets directly from the binary open/close records already present in standard demonstrations. They attach a lightweight stage classifier and a keyframe action predictor as auxiliary heads during fine-tuning, then discard both at test time. This supplies extra gradient signal exactly where uniform action supervision tends to fail.

What stands out is the automatic signal construction. No manual stage labels or task-specific rules are required. The stage head learns to identify current segments from gripper transitions, and the keyframe head predicts the joint configuration at the subsequent transition. Experiments cover bimanual simulation and real single-arm Franka work, with relative success-rate lifts of 14% and 56%. The improvement scales with horizon length and number of gripper events, which matches the stated motivation.

Ablations isolate each head and show both contribute. Qualitative checks confirm the learned features align with the gripper-derived stages. The base VLA architecture and inference loop stay unchanged, so the method is genuinely plug-in.

The main limitation is dependence on gripper state changes as a reliable proxy for meaningful stages. This works in the reported pick-and-place style tasks, but tasks with sustained gripper force or fluid motion may see smaller returns. The gains are reported as relative improvements; absolute numbers and full baseline tables are needed to judge practical impact. No statistical significance tests are highlighted in the summary, though the stress-test indicates the manuscript supplies the supporting details.

The work targets researchers already fine-tuning VLAs for manipulation who want a low-overhead way to add temporal structure. Readers focused on long-horizon robot tasks will see the clearest value. The experiments include both simulation and hardware, the ablations are present, and the construction avoids circularity or hidden fitting.

I would send this to peer review. The idea is clean, the evaluation covers the right axes, and the claims are testable.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes StaKe, a plug-in auxiliary supervision framework for fine-tuning Vision-Language-Action (VLA) models. It automatically derives two signals from demonstration gripper states (binary open/close transitions) without manual annotation or task-specific engineering: (1) a stage classifier identifying the current manipulation stage and (2) a keyframe predictor estimating the target joint action at the next gripper transition. These are implemented as lightweight auxiliary heads that enrich representations during training but are discarded at inference, leaving the base VLA policy unchanged. Experiments on bimanual simulation and single-arm Franka real-robot tasks report relative success-rate gains of 14% and 56%, respectively, with larger improvements on longer-horizon tasks; ablations isolate each head and qualitative analysis shows learned representations align with gripper-event stages.

Significance. If the reported gains and ablations hold under full experimental scrutiny, the work is significant for demonstrating a general, annotation-free method to inject structured stage and keyframe supervision into VLA fine-tuning. The approach directly targets failure modes at gripper transitions in long-horizon tasks, preserves the original inference loop, and requires only lightweight heads, making it broadly applicable. Credit is due for the automatic derivation of supervision signals purely from gripper logs and for including both simulation and real-robot validation with ablations.

minor comments (3)

[Abstract and §4] The abstract and introduction state relative gains of 14% and 56% but do not specify the absolute success rates or the number of trials per condition; adding these numbers (with standard errors) would strengthen the results section.
[§3] The description of how gripper-state transitions are segmented into stages and keyframes is presented at a high level; a short pseudocode or explicit rule set in §3 would improve reproducibility.
[Figures 4-6] Figure captions and axis labels in the qualitative analysis figures should explicitly state which VLA backbone and task are shown to avoid ambiguity when comparing to baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thorough summary and positive evaluation of StaKe. The recommendation for minor revision is noted; however, the report lists no specific major comments requiring response. We address the overall assessment below and confirm that the manuscript already incorporates the elements highlighted as strengths.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical method for adding lightweight auxiliary heads (stage classifier and keyframe predictor) whose training signals are extracted directly from recorded binary gripper states in demonstrations. No equations, derivations, or self-citations are used to reduce the reported success-rate gains to quantities defined by construction from the same fitted parameters. The auxiliary heads are discarded at inference, the evaluation uses held-out tasks, and ablations isolate each component. The central claim therefore rests on external experimental validation rather than any self-definitional or fitted-input reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that gripper states alone suffice to define useful stages and keyframes.

axioms (1)

domain assumption Gripper states in demonstrations can be used to automatically derive manipulation stages and keyframes without manual annotation.
Explicitly stated as the source of the two complementary signals.

pith-pipeline@v0.9.1-grok · 5791 in / 1163 out tokens · 27538 ms · 2026-06-26T04:55:01.084613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 19 linked inside Pith

[1]

A survey on vision-language-action models for embodied ai,

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,”arXiv preprint arXiv:2405.14093, 2024

Pith/arXiv arXiv 2024
[2]

Openflamingo: An open-source framework for training large autoregressive vision- language models,

A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y . Hanafy, W. Zhu, K. Marathe, Y . Bitton, S. Gadre, S. Sagawa,et al., “Openflamingo: An open-source framework for training large autoregressive vision- language models,”arXiv preprint arXiv:2308.01390, 2023

Pith/arXiv arXiv 2023
[3]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

2023
[4]

Rt-1: Robotics transformer for real-world control at scale,

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu,et al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022
[5]

Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023,

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn,et al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023,”URL https://arxiv. org/abs/2307.15818, vol. 1, p. 2, 2024

Pith/arXiv arXiv 2023
[6]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu,et al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023
[7]

Openvla: An open-source vision-language-action model, 2024,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open-source vision-language-action model, 2024,”URL https://arxiv. org/abs/2406.09246, vol. 1, no. 2, p. 4, 2024

Pith/arXiv arXiv 2024
[8]

Octo: An open-source generalist robot policy,

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu,et al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[9]

Open x-embodiment: Robotic learning datasets and rt-x models,

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shah,et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” 2023

2023
[10]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv preprint arXiv:2307.00595, 2023

arXiv 2023
[11]

π 0: A Vision- Language-Action Flow Model for General Robot Control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π 0: A Vision- Language-Action Flow Model for General Robot Control,”arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[12]

π 0.5: A Vision-Language-Action Model with Open-World Generalization,

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai,et al., “π 0.5: A Vision-Language-Action Model with Open-World Generalization,” arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025
[13]

Fine-tuning vision-language- action models: Optimizing speed and success,

M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language- action models: Optimizing speed and success,”arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025
[14]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,

H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V . Ku- mar, “Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,” pp. 4788–4795, 2024

2024
[15]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu,et al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025
[16]

Vision-language foundation models as effective robot imitators,

X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu,et al., “Vision-language foundation models as effective robot imitators,” 2023

2023
[17]

Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation,

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang,et al., “Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation,”arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024
[18]

Rdt-1b: a diffusion foundation model for bimanual manipulation,

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024
[19]

Gr-2: A generative video-language- action model with web-scale knowledge for robot manipulation,

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang,et al., “Gr-2: A generative video-language- action model with web-scale knowledge for robot manipulation,”arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024
[20]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,

L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,” vol. 37, 2024, pp. 124 420–124 450

2024
[21]

Spatialvla: Exploring spatial representations for visual-language-action model,

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang,et al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025
[22]

Hybridvla: Collaborative diffusion and au- toregression in a unified vision-language-action model,

J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu,et al., “Hybridvla: Collaborative diffusion and au- toregression in a unified vision-language-action model,”arXiv preprint arXiv:2503.10631, 2025

Pith/arXiv arXiv 2025
[23]

Fast: Efficient action tokenization for vision-language-action models,

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025
[24]

Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation,

J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen,et al., “Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025

2025
[25]

Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution,

Y . Yue, Y . Wang, B. Kang, Y . Han, S. Wang, S. Song, J. Feng, and G. Huang, “Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution,” vol. 37, 2024, pp. 56 619–56 643

2024
[26]

Conrft: A reinforced fine-tuning method for vla models via consistency policy,

Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao, “Conrft: A reinforced fine-tuning method for vla models via consistency policy,” arXiv preprint arXiv:2502.05450, 2025

arXiv 2025
[27]

Improving vision-language-action model with online reinforcement learning,

Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen, “Improving vision-language-action model with online reinforcement learning,” pp. 15 665–15 672, 2025

2025
[28]

Perceiver-actor: A multi-task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” pp. 785–799, 2023

2023
[29]

Rvt: Robotic view transformer for 3d object manipulation,

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” pp. 694–710, 2023

2023
[30]

Act3d: 3d feature field transformers for multi-task robotic manipulation,

T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3d: 3d feature field transformers for multi-task robotic manipulation,” 2023

2023
[31]

Coarse-to-fine 3d keyframe transporter,

X. Zhu, D. Klee, D. Wang, B. Hu, H. Huang, A. Tangri, R. Walters, and R. Platt, “Coarse-to-fine 3d keyframe transporter,”arXiv preprint arXiv:2502.01773, 2025

arXiv 2025
[32]

Waypoint- based imitation learning for robotic manipulation,

L. X. Shi, A. Sharma, T. Z. Zhao, and C. Finn, “Waypoint- based imitation learning for robotic manipulation,”arXiv preprint arXiv:2307.14326, 2023

arXiv 2023
[33]

Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation,

K. Zhang, P. Ren, B. Lin, J. Lin, S. Ma, H. Xu, and X. Liang, “Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation,” vol. 37, 2024, pp. 54 105–54 136

2024
[34]

Keypoint action tokens enable in-context imitation learning in robotics,

N. Di Palo and E. Johns, “Keypoint action tokens enable in-context imitation learning in robotics,”arXiv preprint arXiv:2403.19578, 2024

arXiv 2024
[35]

Spatio- temporal reasoning of relational keypoint constraints for robotic ma- nipulation,

W. Huang, C. Wang, Y . Li, R. Zhang, and L. R. Fei-Fei, “Spatio- temporal reasoning of relational keypoint constraints for robotic ma- nipulation,”arXiv preprint arXiv:2409.01652, vol. 2, 2024

Pith/arXiv arXiv 2024
[36]

Kisa: A unified keyframe identifier and skill annotator for long-horizon robotics demonstrations,

L. Kou, F. Ni, Y . Zheng, J. Liu, Y . Yuan, Z. Dong, and J. Hao, “Kisa: A unified keyframe identifier and skill annotator for long-horizon robotics demonstrations,” inForty-first International Conference on Machine Learning, 2024

2024
[37]

Prime: Scaffolding manipulation tasks with behavior primitives for data-efficient imitation learning,

T. Gao, S. Nasiriany, H. Liu, Q. Yang, and Y . Zhu, “Prime: Scaffolding manipulation tasks with behavior primitives for data-efficient imitation learning,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8322–8329, 2024

2024
[38]

Learning semantic atomic skills for multi-task robotic manipulation,

Y . Zhu, W. Wang, S. Wu, Y . Shi, and J. Wang, “Learning semantic atomic skills for multi-task robotic manipulation,”arXiv preprint arXiv:2512.18368, 2025

arXiv 2025
[39]

Dexskills: Skill segmentation using haptic data for learning autonomous long-horizon robotic manipulation tasks,

X. Mao, G. Giudici, C. Coppola, K. Althoefer, I. Farkhatdinov, Z. Li, and L. Jamone, “Dexskills: Skill segmentation using haptic data for learning autonomous long-horizon robotic manipulation tasks,” in 2024 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2024, pp. 5104–5111

2024
[40]

Contrastive imitation learning for language-guided multi-task robotic manipulation,

T. Ma, J. Zhou, Z. Wang, R. Qiu, and J. Liang, “Contrastive imitation learning for language-guided multi-task robotic manipulation,”arXiv preprint arXiv:2406.09738, 2024

arXiv 2024
[41]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025
[42]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023

[1] [1]

A survey on vision-language-action models for embodied ai,

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,”arXiv preprint arXiv:2405.14093, 2024

Pith/arXiv arXiv 2024

[2] [2]

Openflamingo: An open-source framework for training large autoregressive vision- language models,

A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y . Hanafy, W. Zhu, K. Marathe, Y . Bitton, S. Gadre, S. Sagawa,et al., “Openflamingo: An open-source framework for training large autoregressive vision- language models,”arXiv preprint arXiv:2308.01390, 2023

Pith/arXiv arXiv 2023

[3] [3]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023

2023

[4] [4]

Rt-1: Robotics transformer for real-world control at scale,

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu,et al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

Pith/arXiv arXiv 2022

[5] [5]

Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023,

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn,et al., “Rt-2: Vision- language-action models transfer web knowledge to robotic control, 2023,”URL https://arxiv. org/abs/2307.15818, vol. 1, p. 2, 2024

Pith/arXiv arXiv 2023

[6] [6]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu,et al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

Pith/arXiv arXiv 2023

[7] [7]

Openvla: An open-source vision-language-action model, 2024,

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Openvla: An open-source vision-language-action model, 2024,”URL https://arxiv. org/abs/2406.09246, vol. 1, no. 2, p. 4, 2024

Pith/arXiv arXiv 2024

[8] [8]

Octo: An open-source generalist robot policy,

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu,et al., “Octo: An open-source generalist robot policy,”arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[9] [9]

Open x-embodiment: Robotic learning datasets and rt-x models,

Q. Vuong, S. Levine, H. R. Walke, K. Pertsch, A. Singh, R. Doshi, C. Xu, J. Luo, L. Tan, D. Shah,et al., “Open x-embodiment: Robotic learning datasets and rt-x models,” 2023

2023

[10] [10]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv preprint arXiv:2307.00595, 2023

arXiv 2023

[11] [11]

π 0: A Vision- Language-Action Flow Model for General Robot Control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π 0: A Vision- Language-Action Flow Model for General Robot Control,”arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[12] [12]

π 0.5: A Vision-Language-Action Model with Open-World Generalization,

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai,et al., “π 0.5: A Vision-Language-Action Model with Open-World Generalization,” arXiv preprint arXiv:2504.16054, 2025

Pith/arXiv arXiv 2025

[13] [13]

Fine-tuning vision-language- action models: Optimizing speed and success,

M. J. Kim, C. Finn, and P. Liang, “Fine-tuning vision-language- action models: Optimizing speed and success,”arXiv preprint arXiv:2502.19645, 2025

Pith/arXiv arXiv 2025

[14] [14]

Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,

H. Bharadhwaj, J. Vakil, M. Sharma, A. Gupta, S. Tulsiani, and V . Ku- mar, “Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking,” pp. 4788–4795, 2024

2024

[15] [15]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,

T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Z. Li, Q. Liang, X. Lin, Y . Ge, Z. Gu,et al., “Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation,”arXiv preprint arXiv:2506.18088, 2025

Pith/arXiv arXiv 2025

[16] [16]

Vision-language foundation models as effective robot imitators,

X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu,et al., “Vision-language foundation models as effective robot imitators,” 2023

2023

[17] [17]

Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation,

Q. Li, Y . Liang, Z. Wang, L. Luo, X. Chen, M. Liao, F. Wei, Y . Deng, S. Xu, Y . Zhang,et al., “Cogact: A foundational vision- language-action model for synergizing cognition and action in robotic manipulation,”arXiv preprint arXiv:2411.19650, 2024

Pith/arXiv arXiv 2024

[18] [18]

Rdt-1b: a diffusion foundation model for bimanual manipulation,

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

Pith/arXiv arXiv 2024

[19] [19]

Gr-2: A generative video-language- action model with web-scale knowledge for robot manipulation,

C.-L. Cheang, G. Chen, Y . Jing, T. Kong, H. Li, Y . Li, Y . Liu, H. Wu, J. Xu, Y . Yang,et al., “Gr-2: A generative video-language- action model with web-scale knowledge for robot manipulation,”arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024

[20] [20]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,

L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,” vol. 37, 2024, pp. 124 420–124 450

2024

[21] [21]

Spatialvla: Exploring spatial representations for visual-language-action model,

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang,et al., “Spatialvla: Exploring spatial representations for visual-language-action model,”arXiv preprint arXiv:2501.15830, 2025

Pith/arXiv arXiv 2025

[22] [22]

Hybridvla: Collaborative diffusion and au- toregression in a unified vision-language-action model,

J. Liu, H. Chen, P. An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu,et al., “Hybridvla: Collaborative diffusion and au- toregression in a unified vision-language-action model,”arXiv preprint arXiv:2503.10631, 2025

Pith/arXiv arXiv 2025

[23] [23]

Fast: Efficient action tokenization for vision-language-action models,

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

Pith/arXiv arXiv 2025

[24] [24]

Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation,

J. Wen, Y . Zhu, J. Li, M. Zhu, Z. Tang, K. Wu, Z. Xu, N. Liu, R. Cheng, C. Shen,et al., “Tinyvla: Towards fast, data-efficient vision- language-action models for robotic manipulation,”IEEE Robotics and Automation Letters, 2025

2025

[25] [25]

Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution,

Y . Yue, Y . Wang, B. Kang, Y . Han, S. Wang, S. Song, J. Feng, and G. Huang, “Deer-vla: Dynamic inference of multimodal large language models for efficient robot execution,” vol. 37, 2024, pp. 56 619–56 643

2024

[26] [26]

Conrft: A reinforced fine-tuning method for vla models via consistency policy,

Y . Chen, S. Tian, S. Liu, Y . Zhou, H. Li, and D. Zhao, “Conrft: A reinforced fine-tuning method for vla models via consistency policy,” arXiv preprint arXiv:2502.05450, 2025

arXiv 2025

[27] [27]

Improving vision-language-action model with online reinforcement learning,

Y . Guo, J. Zhang, X. Chen, X. Ji, Y .-J. Wang, Y . Hu, and J. Chen, “Improving vision-language-action model with online reinforcement learning,” pp. 15 665–15 672, 2025

2025

[28] [28]

Perceiver-actor: A multi-task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” pp. 785–799, 2023

2023

[29] [29]

Rvt: Robotic view transformer for 3d object manipulation,

A. Goyal, J. Xu, Y . Guo, V . Blukis, Y .-W. Chao, and D. Fox, “Rvt: Robotic view transformer for 3d object manipulation,” pp. 694–710, 2023

2023

[30] [30]

Act3d: 3d feature field transformers for multi-task robotic manipulation,

T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki, “Act3d: 3d feature field transformers for multi-task robotic manipulation,” 2023

2023

[31] [31]

Coarse-to-fine 3d keyframe transporter,

X. Zhu, D. Klee, D. Wang, B. Hu, H. Huang, A. Tangri, R. Walters, and R. Platt, “Coarse-to-fine 3d keyframe transporter,”arXiv preprint arXiv:2502.01773, 2025

arXiv 2025

[32] [32]

Waypoint- based imitation learning for robotic manipulation,

L. X. Shi, A. Sharma, T. Z. Zhao, and C. Finn, “Waypoint- based imitation learning for robotic manipulation,”arXiv preprint arXiv:2307.14326, 2023

arXiv 2023

[33] [33]

Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation,

K. Zhang, P. Ren, B. Lin, J. Lin, S. Ma, H. Xu, and X. Liang, “Pivot-r: Primitive-driven waypoint-aware world model for robotic manipulation,” vol. 37, 2024, pp. 54 105–54 136

2024

[34] [34]

Keypoint action tokens enable in-context imitation learning in robotics,

N. Di Palo and E. Johns, “Keypoint action tokens enable in-context imitation learning in robotics,”arXiv preprint arXiv:2403.19578, 2024

arXiv 2024

[35] [35]

Spatio- temporal reasoning of relational keypoint constraints for robotic ma- nipulation,

W. Huang, C. Wang, Y . Li, R. Zhang, and L. R. Fei-Fei, “Spatio- temporal reasoning of relational keypoint constraints for robotic ma- nipulation,”arXiv preprint arXiv:2409.01652, vol. 2, 2024

Pith/arXiv arXiv 2024

[36] [36]

Kisa: A unified keyframe identifier and skill annotator for long-horizon robotics demonstrations,

L. Kou, F. Ni, Y . Zheng, J. Liu, Y . Yuan, Z. Dong, and J. Hao, “Kisa: A unified keyframe identifier and skill annotator for long-horizon robotics demonstrations,” inForty-first International Conference on Machine Learning, 2024

2024

[37] [37]

Prime: Scaffolding manipulation tasks with behavior primitives for data-efficient imitation learning,

T. Gao, S. Nasiriany, H. Liu, Q. Yang, and Y . Zhu, “Prime: Scaffolding manipulation tasks with behavior primitives for data-efficient imitation learning,”IEEE Robotics and Automation Letters, vol. 9, no. 10, pp. 8322–8329, 2024

2024

[38] [38]

Learning semantic atomic skills for multi-task robotic manipulation,

Y . Zhu, W. Wang, S. Wu, Y . Shi, and J. Wang, “Learning semantic atomic skills for multi-task robotic manipulation,”arXiv preprint arXiv:2512.18368, 2025

arXiv 2025

[39] [39]

Dexskills: Skill segmentation using haptic data for learning autonomous long-horizon robotic manipulation tasks,

X. Mao, G. Giudici, C. Coppola, K. Althoefer, I. Farkhatdinov, Z. Li, and L. Jamone, “Dexskills: Skill segmentation using haptic data for learning autonomous long-horizon robotic manipulation tasks,” in 2024 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2024, pp. 5104–5111

2024

[40] [40]

Contrastive imitation learning for language-guided multi-task robotic manipulation,

T. Ma, J. Zhou, Z. Wang, R. Qiu, and J. Liang, “Contrastive imitation learning for language-guided multi-task robotic manipulation,”arXiv preprint arXiv:2406.09738, 2024

arXiv 2024

[41] [41]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, vol. 44, no. 10-11, pp. 1684–1704, 2025

2025

[42] [42]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023