pith. machine review for the scientific record. sign in

arxiv: 2605.12369 · v1 · submitted 2026-05-12 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Authors on Pith no claims yet

Pith reviewed 2026-05-13 03:47 UTC · model grok-4.3

classification 💻 cs.RO
keywords vision-language-action modelsrobot learningattention specializationauxiliary supervisiontask generalizationaction decoderplug-and-play
0
0 comments X

The pith

GuidedVLA improves robot task success by manually guiding individual attention heads in the action decoder to focus on specific task-relevant factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GuidedVLA to address how vision-language-action models can better generalize by avoiding overfitting to spurious correlations like visual shortcuts. It treats the action decoder as an assembly of functional components where each attention head is supervised by auxiliary signals to capture distinct factors such as object grounding, spatial geometry, and temporal skill logic. This explicit guidance leads to higher success rates in both simulation and real-robot experiments for in-domain and out-of-domain tasks compared to standard VLA baselines. Sympathetic readers would care because it suggests a way to build more robust robot learning systems without relying solely on end-to-end implicit learning. The approach shows that the quality of these specialized factors correlates with performance and produces decoupled features.

Core claim

GuidedVLA manually guides the action generation in VLA models by supervising individual attention heads with manually defined auxiliary signals to capture distinct task-relevant factors, including object grounding, spatial geometry, and temporal skill logic. This results in improved success rates across simulation and real-robot experiments in both in-domain and out-of-domain settings, with the specialized factors yielding decoupled, high-quality features that correlate positively with task performance.

What carries the argument

Plug-and-play action attention specialization, where individual attention heads are supervised by auxiliary signals to capture distinct task factors without interfering with the main action objective.

If this is right

  • Explicit supervision of attention heads reduces overfitting to environmental noise and visual shortcuts.
  • Decoupled features from specialized heads improve generalization to new environments.
  • The quality of auxiliary-guided factors directly impacts overall task success.
  • Action decoders can be designed as modular assemblies rather than monolithic learners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar specialization could be applied to other modalities in multimodal models beyond robotics.
  • Automating the definition of auxiliary signals might reduce the manual effort required.
  • Testing on more complex tasks could reveal limits of the three-head setup.
  • Integration with other VLA improvements might compound the benefits.

Load-bearing premise

That manually defined auxiliary signals can be supplied to individual attention heads to capture distinct factors without the heads interfering with one another or the main action objective.

What would settle it

An experiment where adding the specialized heads with auxiliary signals shows no improvement or decrease in success rates compared to the baseline VLA model.

Figures

Figures reproduced from arXiv: 2605.12369 by Bowen Yang, Chao Jing, Chao Wu, Chenhe Zhang, Cunxin Fan, Haidong Cao, Hongyang Li, Junchi Yan, Qifeng Li, Qingwen Bu, Xian Nie, Xiaosong Jia, Yilin Chai, Yuchen Zhou, Yufeng Li, Yu-Gang Jiang, Zhenjie Yang, Zijian Liang, Zuhao Ge, Zuxuan Wu.

Figure 1
Figure 1. Figure 1: We present GuidedVLA, a VLA paradigm in which the action decoder is explicitly guided to capture task-relevant information such as object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA significantly improves success rates in both in-domain and out-of-domain settings, demonstrating the effectiveness of specifying action-decoder attention heads … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of GuidedVLA. We introduce explicit, structured guidance into the multi-head attention layers of the VLA action decoder. Instead of relying on implicitly entangled representations, we repurpose dedicated attention heads to specialize in distinct task-relevant factors: (i) Object Head supervises its attention maps to explicitly ground task-relevant objects and suppress distractors via Lobject; … view at source ↗
Figure 3
Figure 3. Figure 3: ControlNet-style residual adapter for plug-and-play head specialization. The pretrained main attention branch is kept as the behavior-preserving path, while a factor-specific attention branch is fused through a zero-initialized projection. The adapter copies weights from the base policy and gradually injects task-relevant biases during training. supervised head Attnspecified, we introduce a zero-initialize… view at source ↗
Figure 4
Figure 4. Figure 4: For object grounding, Qwen3-VL [3] first identifies the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Automatic factor annotation pipeline. Object masks are initialized by Qwen3-VL point prompts and propagated by SAM2, skill labels are generated by Qwen3-VL from stage descriptions and a predefined skill list, and depth guidance uses frozen depth features without requiring depth labels. The pipeline substantially reduces human annotation time while preserving a human verification step for supervision qualit… view at source ↗
Figure 5
Figure 5. Figure 5: RoboTwin 2.0 Benchmark Performance. Success rates across 8 manipulation tasks comparing the π0 baseline, single￾head experts, and our full model. While specific heads excel at aligned tasks (e.g., depth head for geometry-heavy Beat Hammer Block), the full model (purple) integrates these capabilities to achieve the best overall average performance (90.63%). two bowls and place on rack, and (3) clean the tab… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world Robot Platforms and Evaluation Tasks. (a) ALOHA AgileX dual-arm mobile manipulator with left/right wrist Orbbec Dabai cameras and a third-person Orbbec Dabai camera; we evaluate three household tasks: pick up fruits and vegetables, stack the bowls, clean the tabletop. (b) PSI-Bot equipped with RealMan RM63 arm(s) and DexHand2 Pro hands, with head/chest RealSense D435 cameras; we evaluate three l… view at source ↗
Figure 7
Figure 7. Figure 7: Higher Factor Quality Leads to Better Task Performance. Top: Quantitative analysis on the LIBERO-Plus layout perturbation track shows that improving the quality of each specialized head consistently boosts success rates. (a) Object Head: as the proportion of attention focused on task-relevant object regions increases, success rises from 61.3% to 74.6%, highlighting the importance of precise object-centric … view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Learned Representations in Guid￾edVLA. From top to bottom: (i) Object attention focuses on the manipulation target (e.g., pot handle); (ii) Depth features encode explicit 3D structure; (iii) Skill predictions track the temporal progress of task phases. This confirms that each head specializes in its designated semantic factor as intended. C. Specialization Enables Decoupled Feature Learnin… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of GuidedVLA against Mixture Al￾ternative. Attention head specialization explicitly outperforms learning all objectives in a mixture. non-factorized controls; additional architecture ablations are provided in Appendix F. When object grounding, geometry, and skill objectives are all supervised through all attention heads, their features become entangled, as in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 10
Figure 10. Figure 10: t-SNE Visualization of Attention Outputs. (a) Specialized attention heads (object: yellow, depth: blue, skill: green) form well-separated clusters, demonstrating factor dis￾entanglement and minimal interference. (b) The mixture alter￾native shows overlapping clusters (different colors representing different heads), indicating entangled representations. D. Comparison to Other Factor Guidance Approaches The… view at source ↗
Figure 11
Figure 11. Figure 11: ALOHA real-world generalization settings (T1– T3). From left to right: in-domain (positional) perturbations using a 3 × 3 anchor grid, lighting shifts with colored illumi￾nation, and scene shifts by adding distractor objects. to exactly one regime; we do not combine multiple shifts within a single trial. a) In-domain (positional) generalization.: We perturb the initial object placement within the training… view at source ↗
Figure 12
Figure 12. Figure 12: PSI-Bot real-world generalization settings (T4–T6). From left to right: in-domain (positional) perturbations using a 3 × 3 anchor grid, lighting shifts with colored illumination, and scene shifts by adding distractor objects. from a successful episode, covering the stages of approach, interaction, and completion. From top to bottom, the rows visualize: RGB image, object head attention, predicted depth map… view at source ↗
Figure 13
Figure 13. Figure 13: LIBERO-Plus rollout visualization (spatial task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total. First row shows the original RGB observations during the rollout. Second row visualizes the attention maps from GuidedVLA ’s object head. Third row presents the depth information encoded by the depth encoder, and fourth row illustrates the corresponding … view at source ↗
Figure 14
Figure 14. Figure 14: LIBERO-Plus rollout visualization (object task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: LIBERO-Plus rollout visualization (goal task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: LIBERO-Plus rollout visualization (long task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: RoboTwin 2.0 rollout visualization (beat hammer block). Each column corresponds to one stage in the whole episode, with 7 stages in total. The first row shows the original RGB observations during the rollout. The second, third, and fourth rows visualize the attention maps from GuidedVLA ’s object head for the main camera, left wrist camera, and right wrist camera, respectively. The fifth row presents the … view at source ↗
Figure 18
Figure 18. Figure 18: RoboTwin 2.0 rollout visualization (dump bin bigbin). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: RoboTwin 2.0 rollout visualization (place burger fries). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: RoboTwin 2.0 rollout visualization (place can basket). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Real-robot rollout visualization (ALOHA, T1) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Real-robot rollout visualization (ALOHA, T2) under distribution shifts.Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Real-robot rollout visualization (ALOHA, T3) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Real-robot rollout visualization (PSI-Bot, T4) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Real-robot rollout visualization (PSI-Bot, T5) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Real-robot rollout visualization (PSI-Bot, T6) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Object-head attention on real robots (aligned tasks: T1/T4). For each task, columns show 7 matched key stages of a representative successful rollout (left to right). Top: raw RGB observations. Bottom: normalized attention heatmaps from the object-specialized head overlaid on RGB (warmer colors indicate higher attention) [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Depth/geometry-head diagnostics on real robots (aligned tasks: T2/T5). Columns show 7 matched key stages of a representative successful rollout (left to right). Top: RGB observations. Middle: depth predictions (Depth Anything V3, small variant). Bottom: normalized attention heatmaps from the depth/geometry-specialized head (warmer colors indicate higher attention) [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Skill/temporal diagnostics on a multi-stage real-robot task. Columns show key stages of the tabletop-cleaning sequence. Top: π0 exhibits incorrect temporal progression (e.g., premature termination or missing required sub-steps; marked with red x). Bottom: GuidedVLA completes the required sub-task order, consistent with skill/temporal supervision [PITH_FULL_IMAGE:figures/full_fig_p039_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Representative failure cases of baseline π0 on household manipulation tasks (T1–T3, ALOHA). (a) T1: phantom grasp (top) and grasp offset/slip (bottom) when grasping the small strawberry. (b) T2: half-grasp on nested bowls due to insufficient insertion depth, failing to lift both bowls together. (c) T3: stage-skipping—pouring succeeds but the required tool￾return stage is omitted. Examples are under in-dom… view at source ↗
Figure 31
Figure 31. Figure 31: Representative failure cases of baseline π0 on chemical-lab manipulation tasks (T4–T6, PSI-Bot). (a) T4: transparent beaker induces phantom grasp (top) and rim collision during mantle insertion from clearance misestimation (bottom). (b) T5: miss-grasp under lighting/specular highlights (top) and beaker–beaker collision during nesting under clutter (bottom). (c) T6: collision with the ring structure from g… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GuidedVLA, a framework for Vision-Language-Action (VLA) models that treats the action decoder as modular components by supervising individual attention heads with manually defined auxiliary signals to capture distinct task-relevant factors (object grounding, spatial geometry, temporal skill logic). It claims this explicit guidance reduces overfitting to spurious correlations and yields improved success rates over strong VLA baselines in both in-domain and out-of-domain settings across simulation and real-robot experiments, with the quality of specialized factors shown to correlate positively with performance and produce decoupled features.

Significance. If the empirical improvements hold under rigorous validation, the work would be significant for robot learning by offering a practical plug-and-play mechanism to inject task-specific inductive biases into large VLAs without full retraining. The modular attention specialization could enhance interpretability and robustness, addressing a key limitation of end-to-end VLA training. The reported correlation between factor quality and task success provides a useful supporting observation for future extensions.

major comments (2)
  1. [§3.2] §3.2 (Specialized Attention Heads): The auxiliary signals are described as manually defined external inputs assigned to specific heads, but no formulation of the auxiliary losses, no equations for how they are integrated into the attention computation, and no mechanism (e.g., masking, routing, or weighting) to prevent interference with the primary action objective are provided. This is load-bearing for the central claim that the heads capture decoupled, task-relevant factors without degrading main-task performance.
  2. [§4] §4 (Experiments): No ablation results isolate the contribution of attention-head specialization from other training changes or from the choice of the three specific factors; the reported success-rate gains cannot be attributed to the proposed mechanism. This undermines the out-of-domain generalization claim.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief equation or diagram illustrating the plug-and-play insertion of auxiliary signals into the attention heads.
  2. [§3] Notation for the three specialized heads is introduced informally; consistent symbols and a table summarizing their auxiliary targets would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We appreciate the identification of areas where technical clarity and experimental rigor can be strengthened. We have revised the manuscript to address both major comments by adding the missing formulations and new ablation studies. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Specialized Attention Heads): The auxiliary signals are described as manually defined external inputs assigned to specific heads, but no formulation of the auxiliary losses, no equations for how they are integrated into the attention computation, and no mechanism (e.g., masking, routing, or weighting) to prevent interference with the primary action objective are provided. This is load-bearing for the central claim that the heads capture decoupled, task-relevant factors without degrading main-task performance.

    Authors: We agree that the original §3.2 description was insufficiently precise on these points. In the revised manuscript we have expanded this section with: (i) explicit formulations of the three auxiliary losses (cross-entropy for object grounding, L2 regression for spatial geometry, and next-token prediction for temporal skill logic); (ii) the integration equation L_total = L_action + λ ∑ L_aux_i with the chosen λ schedule; and (iii) the head-masking procedure that routes each auxiliary signal exclusively to its assigned attention head during the forward pass while leaving the primary action loss unaffected. These additions directly support the claim of decoupled factors without performance degradation. revision: yes

  2. Referee: [§4] §4 (Experiments): No ablation results isolate the contribution of attention-head specialization from other training changes or from the choice of the three specific factors; the reported success-rate gains cannot be attributed to the proposed mechanism. This undermines the out-of-domain generalization claim.

    Authors: We acknowledge that the original experiments did not include targeted ablations isolating the specialization mechanism. In the revised version we have added two new ablation suites: (1) variants in which individual specialized heads are disabled one at a time, and (2) comparisons using alternative auxiliary factor sets. The updated results show that removing any specialized head measurably reduces both in-domain and out-of-domain success rates, while the full three-head configuration yields the reported gains. These controls allow the performance improvements to be attributed to the attention specialization rather than other training differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external validation

full rationale

The paper presents GuidedVLA as an empirical framework that assigns manually defined auxiliary signals to specific attention heads and reports measured success-rate gains over baselines in simulation and real-robot experiments. No equations, derivations, or first-principles predictions appear that would reduce the reported improvements to quantities defined by the same inputs or by self-citation chains. The auxiliary signals are described as external manual inputs, and performance is assessed via independent test sets, satisfying the criteria for a self-contained, non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the assumption that three manually chosen auxiliary signals can be attached to distinct attention heads and will produce decoupled, task-relevant features; no free parameters, standard axioms, or new physical entities are explicitly introduced in the abstract.

invented entities (1)
  • specialized attention heads for object grounding, spatial geometry, and temporal skill logic no independent evidence
    purpose: to capture distinct task-relevant factors inside the action decoder
    These heads are introduced as the core instantiation of the GuidedVLA paradigm.

pith-pipeline@v0.9.0 · 5595 in / 1169 out tokens · 89446 ms · 2026-05-13T03:47:39.182913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

118 extracted references · 118 canonical work pages · 18 internal anchors

  1. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al. Do as i can, not as i say: Ground- ing language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  2. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  4. [5]

    3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks,

    Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks.arXiv preprint arXiv:2505.05800, 2025

  5. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  6. [7]

    In9th Annual Conference on Robot Learning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: A vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025

  7. [8]

    InRSS, 2025

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language-action flow model for general robot control. InRSS, 2025

  8. [9]

    Y ., and Levine, S

    Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

  9. [10]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  10. [11]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  11. [12]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  12. [13]

    Storm: Slot-based task-aware object-centric rep- resentation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026

    Alexandre Chapin, Emmanuel Dellandréa, and Liming Chen. Storm: Slot-based task-aware object-centric rep- resentation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026

  13. [14]

    Unified diffusionvla: Vision-language-actionmodelviajointdiscretedenoisingdiffusionprocess.arXivpreprintarXiv:2511.01718,

    Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language- action model via joint discrete denoising diffusion pro- cess.arXiv preprint arXiv:2511.01718, 2025

  14. [15]

    Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna

    Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics. arXiv preprint arXiv:2602.19313, 2026

  15. [16]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xi- aokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable...

  16. [17]

    Moe-dp: An moe-enhanced diffusion policy for robust long-horizon robotic manipulation with skill decomposition and failure recovery.arXiv preprint arXiv:2511.05007, 2025

    Baiye Cheng, Tianhai Liang, Suning Huang, Maanping Shao, Feihong Zhang, Botian Xu, Zhengrong Xue, and Huazhe Xu. Moe-dp: An moe-enhanced diffusion policy for robust long-horizon robotic manipulation with skill decomposition and failure recovery.arXiv preprint arXiv:2511.05007, 2025

  17. [18]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shu- ran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  18. [19]

    arXiv preprint arXiv:2508.08113 (2025)

    Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jed Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

  19. [20]

    Robonet: Large-scale multi-robot learning,

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215, 2019

  20. [21]

    Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

    Pim De Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

  21. [22]

    Stereovla: Enhancing vision- language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

    Shengliang Deng, Mi Yan, Yixin Zheng, Jiayi Su, Wen- hao Zhang, Xiaoguang Zhao, Heming Cui, Zhizheng Zhang, and He Wang. Stereovla: Enhancing vision- language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

  22. [23]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, pages 8469–8488. PMLR, 2023

  23. [24]

    Bridgedata v2: A dataset for robot learning at scale, 2024

    Frederik Ebert et al. Bridgedata v2: A dataset for robot learning at scale.arXiv preprint arXiv:2308.12952, 2023

  24. [25]

    Interleave-vla: Enhancing robot manipulation with interleaved image- text instructions

    Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image- text instructions. InICLR, 2026

  25. [26]

    Peafowl: Perception-enhanced multi-view vision-language-action for bimanual manip- ulation.arXiv preprint arXiv:2601.17885, 2026

    Qingyu Fan, Zhaoxiang Li, Yi Lu, Wang Chen, Qiu Shen, Xiao-xiao Long, Yinghao Cai, Tao Lu, Shuo Wang, and Xun Cao. Peafowl: Perception-enhanced multi-view vision-language-action for bimanual manip- ulation.arXiv preprint arXiv:2601.17885, 2026

  26. [27]

    Learning skills from action-free videos

    Hung-Chieh Fang, Kuo-Han Hung, Chu-Rong Chen, Po-Jung Chou, Chun-Kai Yang, Po-Chen Ko, Yu- Chiang Wang, Yueh-Hua Wu, Min-Hung Chen, and Shao-Hua Sun. Learning skills from action-free videos. arXiv preprint arXiv:2512.20052, 2025

  27. [28]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness anal- ysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  28. [29]

    Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018

  29. [30]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2 (11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2 (11):665–673, 2020

  30. [31]

    Octo: An open- source generalist robot policy

    Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open- source generalist robot policy. InRobotics: Science and Systems, 2024

  31. [32]

    Point policy: Unify- ing observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

    Siddhant Haldar and Lerrel Pinto. Point policy: Unify- ing observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

  32. [33]

    Spot: Se(3) pose trajectory diffusion for object-centric manipulation

    Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, and Stan Birchfield. Spot: Se(3) pose trajectory diffusion for object-centric manipulation. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 4853–4860, 2025. doi: 10.1109/ICRA55743. 2025.11127562. arXiv:2411.00965

  33. [34]

    Lora: Low-rank adaptation of large lan- guage models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large lan- guage models.ICLR, 1(2):3, 2022

  34. [35]

    Skill-aware diffusion for generalizable robotic manipulation.arXiv preprint arXiv:2601.11266, 2026

    Aoshen Huang, Jiaming Chen, Jiyu Cheng, Ran Song, Wei Pan, and Wei Zhang. Skill-aware diffusion for generalizable robotic manipulation.arXiv preprint arXiv:2601.11266, 2026

  35. [36]

    Rekep: Spatio-temporal rea- soning of relational keypoint constraints for robotic manipulation

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal rea- soning of relational keypoint constraints for robotic manipulation. InConference on Robot Learning, pages 4573–4602. PMLR, 2025

  36. [37]

    PointWorld: Scaling 3D world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

    Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

  37. [38]

    Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854,

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

  38. [39]

    Rlbench: The robot learning bench- mark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning bench- mark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  39. [40]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576,

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

  40. [41]

    AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, and Biqing Qi. Asyncvla: Asynchronous flow match- ing for vision-language-action models.arXiv preprint arXiv:2511.14148, 2025

  41. [42]

    VIMA : General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts.arXiv preprint arXiv:2210.03094, 2(3):6, 2022

  42. [43]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  43. [44]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  44. [45]

    Openvla: An open-source vision-language- action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language- action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

  45. [46]

    Trace- gen: World modeling in 3d trace space enables learn- ing from cross-embodiment videos.arXiv preprint arXiv:2511.21690, 2025

    Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, et al. Trace- gen: World modeling in 3d trace space enables learn- ing from cross-embodiment videos.arXiv preprint arXiv:2511.21690, 2025

  46. [47]

    Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model. InInternational Conference on Learning Represen- tations, 2026. URL https://openreview.net/forum?id= euMVC1DO4k

  47. [48]

    H2r: A human-to-robot data augmentation for robot pre- training from videos.arXiv preprint arXiv:2505.11920, 2025

    Guangrun Li, Yaoxu Lyu, Zhuoyang Liu, Chengkai Hou, Jieyu Zhang, and Shanghang Zhang. H2r: A human-to-robot data augmentation for robot pre- training from videos.arXiv preprint arXiv:2505.11920, 2025

  48. [49]

    Language-guided object-centric diffusion policy for generalizable and collision-aware manipulation

    Hang Li, Qian Feng, Zhi Zheng, Jianxiang Feng, Zhaopeng Chen, and Alois Knoll. Language-guided object-centric diffusion policy for generalizable and collision-aware manipulation. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 12834–12841. IEEE, 2025

  49. [50]

    Coa-vla: Improving vision-language-action models via visual-text chain-of- affordance

    Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, et al. Coa-vla: Improving vision-language-action models via visual-text chain-of- affordance. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 9759– 9769, 2025

  50. [51]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  51. [52]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models

    Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025

  52. [53]

    Posa-vla: Enhancing action generation via pose-conditioned anchor attention.arXiv preprint arXiv:2512.03724, 2025

    Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, et al. Posa-vla: Enhancing action generation via pose-conditioned anchor attention.arXiv preprint arXiv:2512.03724, 2025

  53. [54]

    Skilldiffuser: Interpretable skill planning for latent diffusion-based manipulation

    Yixing Liang, Anna Xie, Ziyun Feng, Yuke Zhu, Song- Chun Zhu, and Yunzhu Li. Skilldiffuser: Interpretable skill planning for latent diffusion-based manipulation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16467–16476, 2024

  54. [55]

    Discrete diffusion vla: Bring- ing discrete diffusion to action decoding in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

  55. [56]

    arXiv preprint arXiv:2406.01586 (2024)

    Fanqi Lin, Haojie Lu, Haojian Fang, and Ping Luo. Manicm: Real-time 3d diffusion policy via consis- tency model for robotic manipulation.arXiv preprint arXiv:2406.01586, 2024

  56. [57]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  57. [58]

    Constraint-preserving data generation for one- shot visuomotor policy generalization

    Kevin Lin, Varun Ragunath, Andrew McAlinden, Aa- ditya Prasad, Jimmy Wu, Yuke Zhu, and Jeannette Bohg. Constraint-preserving data generation for one- shot visuomotor policy generalization. InProceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 3631–3646. PMLR, 2025. URL https://proceedings.mlr. ...

  58. [59]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Ad- vances in Neural Information Processing Systems, 36: 44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Ad- vances in Neural Information Processing Systems, 36: 44776–44791, 2023

  59. [60]

    Rdt-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InThe Thirteenth International Conference on Learning Representations, 2025

  60. [61]

    Hierarchical diffu- sion policy for kinematics-aware multi-task robotic ma- nipulation

    Jing Ma, Zhengyi Jiang, Rifat Hoque, Sangwoo Ahn, Pulkit Agrawal, and Kaiming Lee. Hierarchical diffu- sion policy for kinematics-aware multi-task robotic ma- nipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18081–18090, 2024

  61. [62]

    arXiv preprint arXiv:2510.26742 (2025)

    Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, and Haoqiang Fan. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

  62. [63]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, pages 879–893. PMLR, 2018

  63. [64]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.IEEE Robotics and Automation Let- ters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.IEEE Robotics and Automation Let- ters, 7(3):7327–7334, 2022

  64. [65]

    arXiv preprint arXiv:2203.12601 (2022)

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

  65. [66]

    V o-dp: Semantic-geometric adaptive diffu- sion policy for vision-only robotic manipulation.arXiv preprint arXiv:2510.15530, 2025

    Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu, Wei Sui, Hu Su, Junran Peng, Zhipeng Wang, and Bin He. V o-dp: Semantic-geometric adaptive diffu- sion policy for vision-only robotic manipulation.arXiv preprint arXiv:2510.15530, 2025

  66. [67]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  67. [68]

    Omnimanip: Towards general robotic manipulation via object-centric interac- tion primitives as spatial constraints

    Mingjie Pan, Jiyao Zhang, Tianshu Wu, Yinghao Zhao, Wenlong Gao, and Hao Dong. Omnimanip: Towards general robotic manipulation via object-centric interac- tion primitives as spatial constraints. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 17359–17369, 2025

  68. [69]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  69. [70]

    GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

    Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, and Li Jiang. Geopredict: Lever- aging predictive kinematics and 3d gaussian geom- etry for precise vla manipulation.arXiv preprint arXiv:2512.16811, 2025

  70. [71]

    Spatialvla: Exploring spatial representations for visual-language-action model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. In Robotics: Science and Systems, 2025

  71. [72]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InThe Thirteenth Internati...

  72. [73]

    Grounded sam: Assembling open-world models for di- verse visual tasks, 2024

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for di- verse visual tasks, 2024

  73. [74]

    Clare: Continual learning for vision-language-action models via autonomous adapter routing and expansion.arXiv preprint arXiv:2601.09512, 2026

    Ralf Römer, Yi Zhang, and Angela P Schoellig. Clare: Continual learning for vision-language-action models via autonomous adapter routing and expansion.arXiv preprint arXiv:2601.09512, 2026

  74. [75]

    Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025

    Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025

  75. [76]

    Geovla: Em- powering 3d representations in vision-language-action models,

    Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. Geovla: Empowering 3d representa- tions in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025

  76. [77]

    Interactive post-training for vision-language- action models, 2025

    Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

  77. [78]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  78. [79]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  79. [80]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

  80. [81]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–

Showing first 80 references.