pith. sign in

arxiv: 2605.12369 · v2 · pith:VSKU2LRXnew · submitted 2026-05-12 · 💻 cs.RO

GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

Pith reviewed 2026-06-30 22:09 UTC · model grok-4.3

classification 💻 cs.RO
keywords Vision-Language-Action modelsattention specializationauxiliary supervisionrobot manipulationgeneralizationtask-relevant factorsaction decoder
0
0 comments X

The pith

GuidedVLA supervises individual attention heads in VLA action decoders with auxiliary signals for object grounding, spatial geometry, and temporal skill logic to focus on task-relevant factors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard Vision-Language-Action models rely on end-to-end training to learn task-relevant features implicitly, which often causes them to overfit to spurious correlations like visual shortcuts or noise. GuidedVLA instead decomposes the action decoder into modular attention heads and supplies each with a manually defined auxiliary signal so that one head learns object grounding, another spatial geometry, and another temporal skill logic. This explicit supervision produces features that are more decoupled and aligned with what matters for the task. Experiments in simulation and on physical robots show higher success rates than strong VLA baselines, both inside and outside the training distribution.

Core claim

By treating the action decoder as an assembly of functional components and supervising individual attention heads with auxiliary signals for object grounding, spatial geometry, and temporal skill logic, GuidedVLA makes the model focus on task-relevant factors rather than spurious correlations, which improves success rates in both in-domain and out-of-domain robot tasks and yields a positive correlation between factor quality and performance.

What carries the argument

Plug-and-play action attention specialization, in which separate attention heads receive auxiliary supervision to capture distinct task factors.

If this is right

  • Higher success rates on both in-domain and out-of-domain robot tasks compared with standard VLA baselines.
  • Positive correlation between the quality of the specialized factors and overall task performance.
  • Generation of decoupled, high-quality features from the specialized heads.
  • Explicit guidance of action-decoder learning as a route to more robust and general VLA models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular heads could allow inspection of which factor is responsible when a robot fails a task.
  • The same plug-and-play supervision pattern could be applied to other sequential decision-making systems that use vision-language backbones.
  • If the auxiliary signals prove cheap to specify, the approach might scale to additional factors without retraining the entire model from scratch.

Load-bearing premise

Manually defined auxiliary signals for object grounding, spatial geometry, and temporal skill logic can be supplied during training without new labeling costs or interference with the main action objective, and that these three factors cover the needed task information.

What would settle it

An experiment in which adding the three specialized heads produces no measurable gain in success rate over an otherwise identical end-to-end VLA baseline on out-of-domain tasks, or in which measured factor quality shows no correlation with task success.

Figures

Figures reproduced from arXiv: 2605.12369 by Bowen Yang, Chao Jing, Chao Wu, Chenhe Zhang, Cunxin Fan, Haidong Cao, Hongyang Li, Junchi Yan, Qifeng Li, Qingwen Bu, Xian Nie, Xiaosong Jia, Yilin Chai, Yuchen Zhou, Yufeng Li, Yu-Gang Jiang, Zhenjie Yang, Zijian Liang, Zuhao Ge, Zuxuan Wu.

Figure 1
Figure 1. Figure 1: We present GuidedVLA, a VLA paradigm in which the action decoder is explicitly guided to capture task-relevant information such as object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA significantly improves success rates in both in-domain and out-of-domain settings, demonstrating the effectiveness of specifying action-decoder attention heads … view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of GuidedVLA. We introduce explicit, structured guidance into the multi-head attention layers of the VLA action decoder. Instead of relying on implicitly entangled representations, we repurpose dedicated attention heads to specialize in distinct task-relevant factors: (i) Object Head supervises its attention maps to explicitly ground task-relevant objects and suppress distractors via Lobject; … view at source ↗
Figure 3
Figure 3. Figure 3: ControlNet-style residual adapter for plug-and-play head specialization. The pretrained main attention branch is kept as the behavior-preserving path, while a factor-specific attention branch is fused through a zero-initialized projection. The adapter copies weights from the base policy and gradually injects task-relevant biases during training. supervised head Attnspecified, we introduce a zero-initialize… view at source ↗
Figure 4
Figure 4. Figure 4: For object grounding, Qwen3-VL [3] first identifies the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Automatic factor annotation pipeline. Object masks are initialized by Qwen3-VL point prompts and propagated by SAM2, skill labels are generated by Qwen3-VL from stage descriptions and a predefined skill list, and depth guidance uses frozen depth features without requiring depth labels. The pipeline substantially reduces human annotation time while preserving a human verification step for supervision qualit… view at source ↗
Figure 5
Figure 5. Figure 5: RoboTwin 2.0 Benchmark Performance. Success rates across 8 manipulation tasks comparing the π0 baseline, single￾head experts, and our full model. While specific heads excel at aligned tasks (e.g., depth head for geometry-heavy Beat Hammer Block), the full model (purple) integrates these capabilities to achieve the best overall average performance (90.63%). two bowls and place on rack, and (3) clean the tab… view at source ↗
Figure 6
Figure 6. Figure 6: Real-world Robot Platforms and Evaluation Tasks. (a) ALOHA AgileX dual-arm mobile manipulator with left/right wrist Orbbec Dabai cameras and a third-person Orbbec Dabai camera; we evaluate three household tasks: pick up fruits and vegetables, stack the bowls, clean the tabletop. (b) PSI-Bot equipped with RealMan RM63 arm(s) and DexHand2 Pro hands, with head/chest RealSense D435 cameras; we evaluate three l… view at source ↗
Figure 7
Figure 7. Figure 7: Higher Factor Quality Leads to Better Task Performance. Top: Quantitative analysis on the LIBERO-Plus layout perturbation track shows that improving the quality of each specialized head consistently boosts success rates. (a) Object Head: as the proportion of attention focused on task-relevant object regions increases, success rises from 61.3% to 74.6%, highlighting the importance of precise object-centric … view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of Learned Representations in Guid￾edVLA. From top to bottom: (i) Object attention focuses on the manipulation target (e.g., pot handle); (ii) Depth features encode explicit 3D structure; (iii) Skill predictions track the temporal progress of task phases. This confirms that each head specializes in its designated semantic factor as intended. C. Specialization Enables Decoupled Feature Learnin… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of GuidedVLA against Mixture Al￾ternative. Attention head specialization explicitly outperforms learning all objectives in a mixture. non-factorized controls; additional architecture ablations are provided in Appendix F. When object grounding, geometry, and skill objectives are all supervised through all attention heads, their features become entangled, as in [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 10
Figure 10. Figure 10: t-SNE Visualization of Attention Outputs. (a) Specialized attention heads (object: yellow, depth: blue, skill: green) form well-separated clusters, demonstrating factor dis￾entanglement and minimal interference. (b) The mixture alter￾native shows overlapping clusters (different colors representing different heads), indicating entangled representations. D. Comparison to Other Factor Guidance Approaches The… view at source ↗
Figure 11
Figure 11. Figure 11: ALOHA real-world generalization settings (T1– T3). From left to right: in-domain (positional) perturbations using a 3 × 3 anchor grid, lighting shifts with colored illumi￾nation, and scene shifts by adding distractor objects. to exactly one regime; we do not combine multiple shifts within a single trial. a) In-domain (positional) generalization.: We perturb the initial object placement within the training… view at source ↗
Figure 12
Figure 12. Figure 12: PSI-Bot real-world generalization settings (T4–T6). From left to right: in-domain (positional) perturbations using a 3 × 3 anchor grid, lighting shifts with colored illumination, and scene shifts by adding distractor objects. from a successful episode, covering the stages of approach, interaction, and completion. From top to bottom, the rows visualize: RGB image, object head attention, predicted depth map… view at source ↗
Figure 13
Figure 13. Figure 13: LIBERO-Plus rollout visualization (spatial task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total. First row shows the original RGB observations during the rollout. Second row visualizes the attention maps from GuidedVLA ’s object head. Third row presents the depth information encoded by the depth encoder, and fourth row illustrates the corresponding … view at source ↗
Figure 14
Figure 14. Figure 14: LIBERO-Plus rollout visualization (object task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: LIBERO-Plus rollout visualization (goal task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: LIBERO-Plus rollout visualization (long task suite of LIBERO-Plus). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: RoboTwin 2.0 rollout visualization (beat hammer block). Each column corresponds to one stage in the whole episode, with 7 stages in total. The first row shows the original RGB observations during the rollout. The second, third, and fourth rows visualize the attention maps from GuidedVLA ’s object head for the main camera, left wrist camera, and right wrist camera, respectively. The fifth row presents the … view at source ↗
Figure 18
Figure 18. Figure 18: RoboTwin 2.0 rollout visualization (dump bin bigbin). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: RoboTwin 2.0 rollout visualization (place burger fries). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: RoboTwin 2.0 rollout visualization (place can basket). Each column corresponds to one stage in the whole episode, with 7 stages in total [PITH_FULL_IMAGE:figures/full_fig_p035_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Real-robot rollout visualization (ALOHA, T1) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Real-robot rollout visualization (ALOHA, T2) under distribution shifts.Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Real-robot rollout visualization (ALOHA, T3) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p036_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Real-robot rollout visualization (PSI-Bot, T4) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p036_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Real-robot rollout visualization (PSI-Bot, T5) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p037_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Real-robot rollout visualization (PSI-Bot, T6) under distribution shifts. Rows: in-domain (positional) / lighting / scene (top to bottom). Columns show 7 key stages of a representative successful trajectory [PITH_FULL_IMAGE:figures/full_fig_p037_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Object-head attention on real robots (aligned tasks: T1/T4). For each task, columns show 7 matched key stages of a representative successful rollout (left to right). Top: raw RGB observations. Bottom: normalized attention heatmaps from the object-specialized head overlaid on RGB (warmer colors indicate higher attention) [PITH_FULL_IMAGE:figures/full_fig_p038_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Depth/geometry-head diagnostics on real robots (aligned tasks: T2/T5). Columns show 7 matched key stages of a representative successful rollout (left to right). Top: RGB observations. Middle: depth predictions (Depth Anything V3, small variant). Bottom: normalized attention heatmaps from the depth/geometry-specialized head (warmer colors indicate higher attention) [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Skill/temporal diagnostics on a multi-stage real-robot task. Columns show key stages of the tabletop-cleaning sequence. Top: π0 exhibits incorrect temporal progression (e.g., premature termination or missing required sub-steps; marked with red x). Bottom: GuidedVLA completes the required sub-task order, consistent with skill/temporal supervision [PITH_FULL_IMAGE:figures/full_fig_p039_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Representative failure cases of baseline π0 on household manipulation tasks (T1–T3, ALOHA). (a) T1: phantom grasp (top) and grasp offset/slip (bottom) when grasping the small strawberry. (b) T2: half-grasp on nested bowls due to insufficient insertion depth, failing to lift both bowls together. (c) T3: stage-skipping—pouring succeeds but the required tool￾return stage is omitted. Examples are under in-dom… view at source ↗
Figure 31
Figure 31. Figure 31: Representative failure cases of baseline π0 on chemical-lab manipulation tasks (T4–T6, PSI-Bot). (a) T4: transparent beaker induces phantom grasp (top) and rim collision during mantle insertion from clearance misestimation (bottom). (b) T5: miss-grasp under lighting/specular highlights (top) and beaker–beaker collision during nesting under clutter (bottom). (c) T6: collision with the ring structure from g… view at source ↗
read the original abstract

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GuidedVLA, which treats the action decoder in Vision-Language-Action (VLA) models as an assembly of specialized attention heads. Individual heads are supervised during training by manually defined auxiliary signals targeting object grounding, spatial geometry, and temporal skill logic. The goal is to explicitly guide feature learning toward task-relevant factors, reducing overfitting to spurious correlations that occur in standard end-to-end VLA training. Experiments in simulation and on real robots are reported to show higher success rates than strong VLA baselines in both in-domain and out-of-domain settings, with an additional finding that factor quality correlates positively with task performance.

Significance. If the empirical claims hold after quantification, the plug-and-play specialization of attention heads offers a modular inductive bias for VLA models that could improve robustness without full architectural overhaul. The positive correlation result, if supported by ablations, would provide evidence that decoupled features aid generalization. The approach aligns with trends in interpretable transformer components but requires demonstration that the auxiliary signals impose no net increase in supervision cost.

major comments (2)
  1. [Abstract] Abstract: the claim that GuidedVLA 'improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines' supplies no numerical values, baseline names, statistical tests, or ablation results. This absence is load-bearing for the central claim that explicit head supervision yields better generalization than end-to-end training.
  2. [Method] Method description of auxiliary signals: the paper states that signals are 'manually defined' for the three factors but does not specify their source (e.g., ground-truth annotations, simulator state, or additional human labeling). Without this, it is impossible to evaluate the claim that the signals can be supplied at training time without new labeling costs or negative interference with the primary action objective.
minor comments (1)
  1. [Abstract] The abstract refers to 'strong VLA baselines' without naming them; adding the specific model names and training details would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that GuidedVLA 'improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines' supplies no numerical values, baseline names, statistical tests, or ablation results. This absence is load-bearing for the central claim that explicit head supervision yields better generalization than end-to-end training.

    Authors: We agree that the abstract lacks the quantitative details needed to substantiate the central claims. In the revised manuscript, we will update the abstract to include specific numerical success rates from the simulation and real-robot experiments, the names of the strong VLA baselines, and references to the ablation studies and factor-quality correlation results. revision: yes

  2. Referee: [Method] Method description of auxiliary signals: the paper states that signals are 'manually defined' for the three factors but does not specify their source (e.g., ground-truth annotations, simulator state, or additional human labeling). Without this, it is impossible to evaluate the claim that the signals can be supplied at training time without new labeling costs or negative interference with the primary action objective.

    Authors: We acknowledge the ambiguity in the current description. The method section will be revised to explicitly detail the source of each auxiliary signal (object grounding, spatial geometry, and temporal skill logic), including whether they derive from simulator state, existing annotations, or perception pipelines, and to confirm that no additional labeling costs or interference with the primary objective are introduced. revision: yes

Circularity Check

0 steps flagged

No circularity: method relies on external manual auxiliary signals with no self-referential derivations or fitted predictions.

full rationale

The paper presents a framework for supervising attention heads with manually defined auxiliary signals for object grounding, spatial geometry, and temporal skill logic. No equations, derivations, or first-principles predictions are present in the abstract or described method. The auxiliary signals are explicitly external inputs supplied at training time rather than outputs derived from the model itself or from self-citations. No load-bearing claims reduce to self-definition, renaming of known results, or fitted parameters presented as predictions. The central claim of improved generalization rests on empirical experiments comparing to baselines, which are independent of any internal circular construction. This is a standard non-circular proposal of an architectural modification with external supervision.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities; all such elements remain unidentified.

pith-pipeline@v0.9.1-grok · 5826 in / 1018 out tokens · 22721 ms · 2026-06-30T22:09:14.456662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    ELAN4D introduces plug-and-play 4D keypoint track supervision from forward kinematics to enhance VLA policy generalization in robotic manipulation tasks.

Reference graph

Works this paper leans on

118 extracted references · 66 canonical work pages · cited by 1 Pith paper · 31 internal anchors

  1. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- man, et al. Do as i can, not as i say: Ground- ing language in robotic affordances.arXiv preprint arXiv:2204.01691, 2022

  2. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  3. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

  4. [5]

    3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks.arXiv preprint arXiv:2505.05800, 2025

    Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, and Farshad Khorrami. 3d cavla: Leveraging depth and 3d context to generalize vision language action models for unseen tasks.arXiv preprint arXiv:2505.05800, 2025

  5. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

  6. [7]

    In9th Annual Conference on Robot Learning, 2025

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y Galliker, et al.π 0.5: A vision-language-action model with open- world generalization. In9th Annual Conference on Robot Learning, 2025

  7. [8]

    InRSS, 2025

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0: A vision-language-action flow model for general robot control. InRSS, 2025

  8. [9]

    Real-Time Execution of Action Chunking Flow Policies

    Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies. arXiv preprint arXiv:2506.07339, 2025

  9. [10]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

  10. [11]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

  11. [12]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

  12. [13]

    STORM: Slot-based Task-aware Object-centric Representation for robotic Manipulation

    Alexandre Chapin, Emmanuel Dellandréa, and Liming Chen. Storm: Slot-based task-aware object-centric rep- resentation for robotic manipulation.arXiv preprint arXiv:2601.20381, 2026

  13. [14]

    Unified diffusion vla: Vision-language- action model via joint discrete denoising diffusion pro- cess.arXiv preprint arXiv:2511.01718, 2025

    Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, and Haoang Li. Unified diffusion vla: Vision-language- action model via joint discrete denoising diffusion pro- cess.arXiv preprint arXiv:2511.01718, 2025

  14. [15]

    Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna

    Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics. arXiv preprint arXiv:2602.19313, 2026

  15. [16]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xi- aokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable...

  16. [17]

    Moe-dp: An moe-enhanced diffusion policy for robust long-horizon robotic manipulation with skill decomposition and failure recovery.arXiv preprint arXiv:2511.05007, 2025

    Baiye Cheng, Tianhai Liang, Suning Huang, Maanping Shao, Feihong Zhang, Botian Xu, Zhengrong Xue, and Huazhe Xu. Moe-dp: An moe-enhanced diffusion policy for robust long-horizon robotic manipulation with skill decomposition and failure recovery.arXiv preprint arXiv:2511.05007, 2025

  17. [18]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shu- ran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  18. [19]

    Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

    Yinpei Dai, Jayjun Lee, Yichi Zhang, Ziqiao Ma, Jed Yang, Amir Zadeh, Chuan Li, Nima Fazeli, and Joyce Chai. Aimbot: A simple auxiliary visual cue to enhance spatial awareness of visuomotor policies.arXiv preprint arXiv:2508.08113, 2025

  19. [20]

    RoboNet: Large-Scale Multi-Robot Learning

    Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning.arXiv preprint arXiv:1910.11215, 2019

  20. [21]

    Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

    Pim De Haan, Dinesh Jayaraman, and Sergey Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

  21. [22]

    StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

    Shengliang Deng, Mi Yan, Yixin Zheng, Jiayi Su, Wen- hao Zhang, Xiaoguang Zhao, Heming Cui, Zhizheng Zhang, and He Wang. Stereovla: Enhancing vision- language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

  22. [23]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, pages 8469–8488. PMLR, 2023

  23. [24]

    Bridgedata v2: A dataset for robot learning at scale.arXiv preprint arXiv:2308.12952, 2023

    Frederik Ebert et al. Bridgedata v2: A dataset for robot learning at scale.arXiv preprint arXiv:2308.12952, 2023

  24. [25]

    Interleave-vla: Enhancing robot manipulation with interleaved image- text instructions

    Cunxin Fan, Xiaosong Jia, Yihang Sun, Yixiao Wang, Jianglan Wei, Ziyang Gong, Xiangyu Zhao, Masayoshi Tomizuka, Xue Yang, Junchi Yan, et al. Interleave-vla: Enhancing robot manipulation with interleaved image- text instructions. InICLR, 2026

  25. [26]

    Peafowl: Perception-enhanced multi-view vision-language-action for bimanual manip- ulation.arXiv preprint arXiv:2601.17885, 2026

    Qingyu Fan, Zhaoxiang Li, Yi Lu, Wang Chen, Qiu Shen, Xiao-xiao Long, Yinghao Cai, Tao Lu, Shuo Wang, and Xun Cao. Peafowl: Perception-enhanced multi-view vision-language-action for bimanual manip- ulation.arXiv preprint arXiv:2601.17885, 2026

  26. [27]

    Learning skills from action-free videos

    Hung-Chieh Fang, Kuo-Han Hung, Chu-Rong Chen, Po-Jung Chou, Chun-Kai Yang, Po-Chen Ko, Yu- Chiang Wang, Yueh-Hua Wu, Min-Hung Chen, and Shao-Hua Sun. Learning skills from action-free videos. arXiv preprint arXiv:2512.20052, 2025

  27. [28]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness anal- ysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

  28. [29]

    Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. InInternational conference on learning representations, 2018

  29. [30]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2 (11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2 (11):665–673, 2020

  30. [31]

    Octo: An open- source generalist robot policy

    Dibya Ghosh, Homer Rich Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, et al. Octo: An open- source generalist robot policy. InRobotics: Science and Systems, 2024

  31. [32]

    Point policy: Unify- ing observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

    Siddhant Haldar and Lerrel Pinto. Point policy: Unify- ing observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

  32. [33]

    In: 2025 IEEE International Conference on Robotics and Automa- tion (ICRA)

    Cheng-Chun Hsu, Bowen Wen, Jie Xu, Yashraj Narang, Xiaolong Wang, Yuke Zhu, Joydeep Biswas, and Stan Birchfield. Spot: Se(3) pose trajectory diffusion for object-centric manipulation. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 4853–4860, 2025. doi: 10.1109/ICRA55743. 2025.11127562. arXiv:2411.00965

  33. [34]

    Lora: Low-rank adaptation of large lan- guage models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large lan- guage models.ICLR, 1(2):3, 2022

  34. [35]

    Skill-aware diffusion for generalizable robotic manipulation.arXiv preprint arXiv:2601.11266, 2026

    Aoshen Huang, Jiaming Chen, Jiyu Cheng, Ran Song, Wei Pan, and Wei Zhang. Skill-aware diffusion for generalizable robotic manipulation.arXiv preprint arXiv:2601.11266, 2026

  35. [36]

    Rekep: Spatio-temporal rea- soning of relational keypoint constraints for robotic manipulation

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal rea- soning of relational keypoint constraints for robotic manipulation. InConference on Robot Learning, pages 4573–4602. PMLR, 2025

  36. [37]

    Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

    Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2026

  37. [38]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    Chia-Yu Hung, Qi Sun, Pengfei Hong, Amir Zadeh, Chuan Li, U Tan, Navonil Majumder, Soujanya Poria, et al. Nora: A small open-sourced generalist vision language action model for embodied tasks.arXiv preprint arXiv:2504.19854, 2025

  38. [39]

    Rlbench: The robot learning bench- mark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

    Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. Rlbench: The robot learning bench- mark & learning environment.IEEE Robotics and Automation Letters, 5(2):3019–3026, 2020

  39. [40]

    Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

    Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, and Hang Zhao. Galaxea open-world dataset and g0 dual-system vla model.arXiv preprint arXiv:2509.00576, 2025

  40. [41]

    AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models

    Yuhua Jiang, Shuang Cheng, Yan Ding, Feifei Gao, and Biqing Qi. Asyncvla: Asynchronous flow match- ing for vision-language-action models.arXiv preprint arXiv:2511.14148, 2025

  41. [42]

    Vima: General robot manipulation with multimodal prompts.arXiv preprint arXiv:2210.03094, 2(3):6, 2022

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts.arXiv preprint arXiv:2210.03094, 2(3):6, 2022

  42. [43]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  43. [44]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645, 2025

  44. [45]

    Openvla: An open-source vision-language- action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language- action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025

  45. [46]

    Trace- gen: World modeling in 3d trace space enables learn- ing from cross-embodiment videos.arXiv preprint arXiv:2511.21690, 2025

    Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, et al. Trace- gen: World modeling in 3d trace space enables learn- ing from cross-embodiment videos.arXiv preprint arXiv:2511.21690, 2025

  46. [47]

    Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model

    Fuhao Li, Wenxuan Song, Han Zhao, Jingbo Wang, Pengxiang Ding, Donglin Wang, Long Zeng, and Haoang Li. Spatial forcing: Implicit spatial repre- sentation alignment for vision-language-action model. InInternational Conference on Learning Represen- tations, 2026. URL https://openreview.net/forum?id= euMVC1DO4k

  47. [48]

    H2r: A human-to-robot data augmentation for robot pre- training from videos.arXiv preprint arXiv:2505.11920, 2025

    Guangrun Li, Yaoxu Lyu, Zhuoyang Liu, Chengkai Hou, Jieyu Zhang, and Shanghang Zhang. H2r: A human-to-robot data augmentation for robot pre- training from videos.arXiv preprint arXiv:2505.11920, 2025

  48. [49]

    Language-guided object-centric diffusion policy for generalizable and collision-aware manipulation

    Hang Li, Qian Feng, Zhi Zheng, Jianxiang Feng, Zhaopeng Chen, and Alois Knoll. Language-guided object-centric diffusion policy for generalizable and collision-aware manipulation. In2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pages 12834–12841. IEEE, 2025

  49. [50]

    Coa-vla: Improving vision-language-action models via visual-text chain-of- affordance

    Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, et al. Coa-vla: Improving vision-language-action models via visual-text chain-of- affordance. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 9759– 9769, 2025

  50. [51]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023

  51. [52]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025

    Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, and Tieniu Tan. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025

  52. [53]

    Posa-vla: Enhancing action generation via pose-conditioned anchor attention.arXiv preprint arXiv:2512.03724, 2025

    Ziwen Li, Xin Wang, Hanlue Zhang, Runnan Chen, Runqi Lin, Xiao He, Han Huang, Yandong Guo, Fakhri Karray, Tongliang Liu, et al. Posa-vla: Enhancing action generation via pose-conditioned anchor attention.arXiv preprint arXiv:2512.03724, 2025

  53. [54]

    Skilldiffuser: Interpretable skill planning for latent diffusion-based manipulation

    Yixing Liang, Anna Xie, Ziyun Feng, Yuke Zhu, Song- Chun Zhu, and Yunzhu Li. Skilldiffuser: Interpretable skill planning for latent diffusion-based manipulation. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 16467–16476, 2024

  54. [55]

    Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

    Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, et al. Discrete diffu- sion vla: Bringing discrete diffusion to action decod- ing in vision-language-action policies.arXiv preprint arXiv:2508.20072, 2025

  55. [56]

    Manicm: Real-time 3d diffusion policy via consis- tency model for robotic manipulation.arXiv preprint arXiv:2406.01586, 2024

    Fanqi Lin, Haojie Lu, Haojian Fang, and Ping Luo. Manicm: Real-time 3d diffusion policy via consis- tency model for robotic manipulation.arXiv preprint arXiv:2406.01586, 2024

  56. [57]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

  57. [58]

    Constraint-preserving data generation for one- shot visuomotor policy generalization

    Kevin Lin, Varun Ragunath, Andrew McAlinden, Aa- ditya Prasad, Jimmy Wu, Yuke Zhu, and Jeannette Bohg. Constraint-preserving data generation for one- shot visuomotor policy generalization. InProceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 3631–3646. PMLR, 2025. URL https://proceedings.mlr. ...

  58. [59]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Ad- vances in Neural Information Processing Systems, 36: 44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Ad- vances in Neural Information Processing Systems, 36: 44776–44791, 2023

  59. [60]

    Rdt-1b: a diffusion foundation model for bimanual manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InThe Thirteenth International Conference on Learning Representations, 2025

  60. [61]

    Hierarchical diffu- sion policy for kinematics-aware multi-task robotic ma- nipulation

    Jing Ma, Zhengyi Jiang, Rifat Hoque, Sangwoo Ahn, Pulkit Agrawal, and Kaiming Lee. Hierarchical diffu- sion policy for kinematics-aware multi-task robotic ma- nipulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18081–18090, 2024

  61. [62]

    Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

    Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, and Haoqiang Fan. Running vlas at real-time speed.arXiv preprint arXiv:2510.26742, 2025

  62. [63]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. InConference on Robot Learning, pages 879–893. PMLR, 2018

  63. [64]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.IEEE Robotics and Automation Let- ters, 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot ma- nipulation tasks.IEEE Robotics and Automation Let- ters, 7(3):7327–7334, 2022

  64. [65]

    R3M: A Universal Visual Representation for Robot Manipulation

    Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation.arXiv preprint arXiv:2203.12601, 2022

  65. [66]

    V o-dp: Semantic-geometric adaptive diffu- sion policy for vision-only robotic manipulation.arXiv preprint arXiv:2510.15530, 2025

    Zehao Ni, Yonghao He, Lingfeng Qian, Jilei Mao, Fa Fu, Wei Sui, Hu Su, Junran Peng, Zhipeng Wang, and Bin He. V o-dp: Semantic-geometric adaptive diffu- sion policy for vision-only robotic manipulation.arXiv preprint arXiv:2510.15530, 2025

  66. [67]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collabo- ration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  67. [68]

    Omnimanip: Towards general robotic manipulation via object-centric interac- tion primitives as spatial constraints

    Mingjie Pan, Jiyao Zhang, Tianshu Wu, Yinghao Zhao, Wenlong Gao, and Hao Dong. Omnimanip: Towards general robotic manipulation via object-centric interac- tion primitives as spatial constraints. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 17359–17369, 2025

  68. [69]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models.arXiv preprint arXiv:2501.09747, 2025

  69. [70]

    GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation

    Jingjing Qian, Boyao Han, Chen Shi, Lei Xiao, Long Yang, Shaoshuai Shi, and Li Jiang. Geopredict: Lever- aging predictive kinematics and 3d gaussian geom- etry for precise vla manipulation.arXiv preprint arXiv:2512.16811, 2025

  70. [71]

    Spatialvla: Exploring spatial representations for visual-language-action model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. In Robotics: Science and Systems, 2025

  71. [72]

    SAM 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollar, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos. InThe Thirteenth Internati...

  72. [73]

    Grounded sam: Assembling open-world models for di- verse visual tasks, 2024

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for di- verse visual tasks, 2024

  73. [74]

    CLARE: Continual Learning for Vision-Language-Action Models via Autonomous Adapter Routing and Expansion

    Ralf Römer, Yi Zhang, and Angela P Schoellig. Clare: Continual learning for vision-language-action models via autonomous adapter routing and expansion.arXiv preprint arXiv:2601.09512, 2026

  74. [75]

    Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025

    Weijie Shen, Yitian Liu, Yuhao Wu, Zhixuan Liang, Sijia Gu, Dehui Wang, Tian Nian, Lei Xu, Yusen Qin, Jiangmiao Pang, et al. Expertise need not monopolize: Action-specialized mixture of experts for vision-language-action learning.arXiv preprint arXiv:2510.14300, 2025

  75. [76]

    Geovla: Empowering 3d representa- tions in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025

    Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, and Jiale Cao. Geovla: Empowering 3d representa- tions in vision-language-action models.arXiv preprint arXiv:2508.09071, 2025

  76. [77]

    Interactive Post-Training for Vision-Language-Action Models

    Shuhan Tan, Kairan Dou, Yue Zhao, and Philipp Krähenbühl. Interactive post-training for vision-language-action models.arXiv preprint arXiv:2505.17016, 2025

  77. [78]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

  78. [79]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  79. [80]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017

  80. [81]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–

Showing first 80 references.