pith. machine review for the scientific record. sign in

arxiv: 2501.15830 · v5 · submitted 2025-01-27 · 💻 cs.RO · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Bin Zhao, Delin Qu, Dong Wang, Haoming Song, Jiayuan Gu, Qizhi Chen, Xinyi Ye, Xuelong Li, Yan Ding, Yuanqi Yao, Zhigang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:06 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords SpatialVLAvisual-language-actionspatial representationrobot manipulation3D position encodingaction gridsgeneralizationzero-shot
0
0 comments X

The pith

SpatialVLA uses 3D position encoding and adaptive action grids to build generalist robot manipulation policies with strong generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that spatial understanding is the key to effective robot manipulation by developing SpatialVLA, a visual-language-action model enhanced with specific spatial components. Ego3D Position Encoding adds 3D information to visual observations, while Adaptive Action Grids discretize actions adaptively to learn transferrable spatial knowledge. Pre-trained on 1.1 million real-world episodes, the model achieves zero-shot performance on multiple tasks and demonstrates advantages in complex trajectory inference and multi-task generalization in both simulation and real robots. It also supports efficient fine-tuning for new setups through re-discretization of the action grids. Sympathetic readers would care as this points to a path for creating more adaptable robot foundation models that require less customization per environment.

Core claim

By introducing Ego3D Position Encoding to inject 3D information into the input observations and proposing Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, SpatialVLA facilitates learning generalizable and transferrable spatial action knowledge for cross-robot control. Pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, it learns a generalist manipulation policy that is directly applied in a zero-shot manner, with superior results showing advantages in inferring complex robot motion trajectories and strong in-domain multi-task generalization ability. The Adaptive Action Grids further offer an new or

What carries the argument

The Ego3D Position Encoding and Adaptive Action Grids, which together provide spatial awareness to the visual-language-action model by adding 3D positional data to inputs and using adaptive grids for action representation to support cross-robot transfer.

If this is right

  • Direct zero-shot application to numerous tasks after pre-training on 1.1M episodes.
  • Advantage in inferring complex robot motion trajectories in simulation and real-world.
  • Strong in-domain multi-task generalization across multiple robot environments.
  • Effective fine-tuning for new simulation and real-world setups via re-discretized action grids.
  • Exceptional in-distribution generalization and out-of-distribution adaptation capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar spatial injection techniques could be applied to other foundation models in robotics to improve their spatial reasoning without full retraining.
  • The adaptive discretization might allow for easier integration of new robot hardware by preserving learned spatial priors.
  • Extending this to longer-horizon tasks or environments with dynamic obstacles could test the limits of the spatial representations.
  • Combining the model with online adaptation mechanisms might further enhance real-world deployment reliability.

Load-bearing premise

That the reported performance gains stem mainly from the Ego3D Position Encoding and Adaptive Action Grids rather than from the choice of vision-language model base or the volume of pre-training data alone.

What would settle it

Training an identical model without the Ego3D encoding or with non-adaptive fixed action grids on the same 1.1M episodes and evaluating whether the generalization metrics in simulation and real-world tasks match or fall short of the SpatialVLA results.

read the original abstract

In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control. SpatialVLA is first pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, to learn a generalist manipulation policy across multiple robot environments and tasks. After pre-training, SpatialVLA is directly applied to perform numerous tasks in a zero-shot manner. The superior results in both simulation and real-world robots demonstrate its advantage of inferring complex robot motion trajectories and its strong in-domain multi-task generalization ability. We further show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups, where the pre-learned action grids are re-discretized to capture robot-specific spatial action movements of new setups. The superior results from extensive evaluations demonstrate the exceptional in-distribution generalization and out-of-distribution adaptation capability, highlighting the crucial benefit of the proposed spatial-aware representations for generalist robot policy learning. All the details and codes will be open-sourced.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpatialVLA, a visual-language-action model for robot manipulation that augments a VLM backbone with Ego3D Position Encoding (to inject 3D spatial information into visual observations) and Adaptive Action Grids (to discretize actions adaptively for cross-robot transfer). The model is pre-trained on 1.1 million real-world robot episodes, then evaluated zero-shot on simulation and real-world tasks and further fine-tuned via re-discretization of the action grids for new setups. The central claim is that these spatial representations enable superior trajectory inference, strong in-domain multi-task generalization, and effective out-of-distribution adaptation compared to prior VLA approaches.

Significance. If the performance claims are supported by rigorous quantitative evidence and isolating ablations, the work would meaningfully advance generalist robot policies by demonstrating that explicit spatial encodings and adaptive action discretization can improve generalization across robots and tasks beyond scale alone. The large-scale pre-training regime and commitment to open-sourcing code and models are positive contributions that could facilitate follow-on research.

major comments (2)
  1. [§4 (Experiments)] §4 (Experiments) and associated tables/figures: the central attribution of superior zero-shot and fine-tuning results to Ego3D Position Encoding and Adaptive Action Grids is not yet load-bearing because the manuscript reports only end-to-end comparisons against external baselines. No controlled ablations are described that hold the 1.1M-episode pre-training data, VLM backbone, and training procedure fixed while swapping in standard positional encodings or fixed (non-adaptive) action grids. Without these, it remains possible that gains derive primarily from pre-training scale rather than the proposed spatial components.
  2. [Abstract and §4] Abstract and §4: the repeated claim of 'superior results' and 'strong in-domain multi-task generalization' is presented without early quantitative anchors (specific success rates, baselines, error bars, or statistical significance). This makes the strength of the empirical support difficult to assess from the high-level summary and requires the reader to locate the precise metrics and comparisons later in the text.
minor comments (2)
  1. [§3] Notation for Ego3D Position Encoding and the discretization parameters of Adaptive Action Grids should be introduced with explicit equations or pseudocode in §3 to allow precise reproduction.
  2. [Conclusion] The manuscript states that 'all details and codes will be open-sourced' but does not specify the exact release timeline or repository; adding this information would strengthen reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments and the opportunity to clarify and strengthen our manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments) and associated tables/figures: the central attribution of superior zero-shot and fine-tuning results to Ego3D Position Encoding and Adaptive Action Grids is not yet load-bearing because the manuscript reports only end-to-end comparisons against external baselines. No controlled ablations are described that hold the 1.1M-episode pre-training data, VLM backbone, and training procedure fixed while swapping in standard positional encodings or fixed (non-adaptive) action grids. Without these, it remains possible that gains derive primarily from pre-training scale rather than the proposed spatial components.

    Authors: We agree that controlled ablations would provide stronger evidence for the specific contributions of our proposed components. In the revised manuscript, we will include additional experiments that fix the 1.1M-episode pre-training data, VLM backbone, and training procedure, and compare variants with standard positional encodings versus Ego3D Position Encoding, as well as fixed action grids versus Adaptive Action Grids. These ablations will help isolate the impact of the spatial representations. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4: the repeated claim of 'superior results' and 'strong in-domain multi-task generalization' is presented without early quantitative anchors (specific success rates, baselines, error bars, or statistical significance). This makes the strength of the empirical support difficult to assess from the high-level summary and requires the reader to locate the precise metrics and comparisons later in the text.

    Authors: We acknowledge that early quantitative anchors would enhance the clarity of our claims. We will revise the abstract and the opening of §4 to include specific success rates from our evaluations, comparisons to key baselines, and references to error bars and statistical details provided in the tables and figures. This will allow readers to immediately gauge the empirical support without needing to search further in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pre-training and evaluation with no self-referential derivations

full rationale

The paper proposes two spatial components (Ego3D Position Encoding and Adaptive Action Grids), pre-trains a VLA model on 1.1M real-world episodes, then reports zero-shot and fine-tuning results on simulation and real robots. No equations, predictions, or first-principles derivations are presented that reduce to fitted parameters or prior self-citations by construction. All claims are framed as measured experimental outcomes rather than analytic necessities. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the abstract or described method. This is a standard empirical robotics paper whose central claims rest on external benchmarks and data, not internal tautologies.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 2 invented entities

The central claims rest on the assumption that adding explicit 3D encodings and adaptive discretization will improve spatial reasoning in transformer-based VLA models trained at scale, plus standard assumptions about large-scale pre-training leading to generalization.

free parameters (1)
  • Action grid discretization resolution
    Adaptive grids require choices of bin sizes or resolution that are likely tuned per robot or task.
axioms (2)
  • domain assumption Vision-language models can be extended with additional position encodings to incorporate 3D spatial information effectively
    Invoked when introducing Ego3D Position Encoding as a direct injection into input observations.
  • domain assumption Discretized action grids can capture transferable spatial movement knowledge across robots
    Central to the claim that re-discretization enables fine-tuning for new setups.
invented entities (2)
  • Ego3D Position Encoding no independent evidence
    purpose: Inject 3D information into the input observations of the visual-language-action model
    New encoding scheme proposed to address spatial understanding limitations.
  • Adaptive Action Grids no independent evidence
    purpose: Represent spatial robot movement actions with adaptive discretized action grids
    New representation for actions to facilitate generalizable and transferable spatial knowledge.

pith-pipeline@v0.9.0 · 5590 in / 1643 out tokens · 110445 ms · 2026-05-12T06:06:01.441357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing D3_admits_circle_linking echoes

    we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  2. VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    VEGA improves spatial reasoning in VLA models for robotics by aligning visual encoder features with 3D-supervised DINOv2 representations via a temporary projector and cosine similarity loss.

  3. Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.

  4. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  5. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  6. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  7. Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation

    cs.AI 2026-05 unverdicted novelty 7.0

    A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.

  8. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  9. VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

    cs.RO 2026-04 unverdicted novelty 7.0

    VistaBot integrates 4D geometry estimation and spatiotemporal view synthesis into action policies to improve cross-view generalization by 2.6-2.8x on a new VGS metric in simulation and real tasks.

  10. Why MLLMs Struggle to Determine Object Orientations

    cs.CV 2026-04 accept novelty 7.0

    Orientation information is recoverable from MLLM visual encoder embeddings via linear regression, contradicting the hypothesis that failures originate in the encoders.

  11. DFM-VLA: Iterative Action Refinement for Robot Manipulation via Discrete Flow Matching

    cs.RO 2026-03 conditional novelty 7.0

    DFM-VLA uses discrete flow matching to iteratively refine action tokens in VLA models, outperforming autoregressive and diffusion baselines with 4.44 average success length on CALVIN and 95.7% success on LIBERO.

  12. VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

  13. FrameSkip: Learning from Fewer but More Informative Frames in VLA Training

    cs.RO 2026-05 unverdicted novelty 6.0

    FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.

  14. See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 6.0

    GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.

  15. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  16. VISOR: A Vision-Language Model-based Test Oracle for Testing Robot

    cs.SE 2026-05 unverdicted novelty 6.0

    VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...

  17. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  18. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  19. ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

    cs.CV 2026-05 unverdicted novelty 6.0

    ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.

  20. TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

    cs.CV 2026-05 unverdicted novelty 6.0

    TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.

  21. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  22. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  23. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.

  24. LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning

    cs.RO 2026-04 unverdicted novelty 6.0

    LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.

  25. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.

  26. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  27. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  28. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  29. Test-Time Perturbation Learning with Delayed Feedback for Vision-Language-Action Models

    cs.CV 2026-04 unverdicted novelty 6.0

    PDF improves VLA success rates on LIBERO and Atari by applying test-time perturbation learning with delayed feedback to correct trajectory overfitting and overconfidence.

  30. Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks...

  31. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  32. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  33. Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.

  34. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  35. X-Imitator: Spatial-Aware Imitation Learning via Bidirectional Action-Pose Interaction

    cs.RO 2026-05 unverdicted novelty 5.0

    X-Imitator is a bidirectional action-pose interaction framework for spatial-aware imitation learning that outperforms vanilla policies and explicit pose guidance on 24 simulated and 3 real-world robotic tasks.

  36. Gated Memory Policy

    cs.RO 2026-04 unverdicted novelty 5.0

    GMP selectively activates and represents memory via a gate and lightweight cross-attention, yielding 30.1% higher success on non-Markovian robotic tasks while staying competitive on Markovian ones.

  37. R3D: Revisiting 3D Policy Learning

    cs.CV 2026-04 unverdicted novelty 5.0

    A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.

  38. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  39. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  40. Fast-dVLA: Accelerating Discrete Diffusion VLA to Real-Time Performance

    cs.RO 2026-03 unverdicted novelty 5.0

    Parameter differences from two training runs on a small task set are treated as auxiliary capability vectors that are merged into a pretrained VLA model, yielding auxiliary-task gains at the cost of ordinary supervise...

  41. MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    cs.CV 2025-07 unverdicted novelty 5.0

    MoGe-2 recovers metric-scale 3D point maps with fine details from single images via data refinement and extension of affine-invariant predictions.

  42. JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

    cs.RO 2026-04 unverdicted novelty 4.0

    JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 37 Pith papers · 11 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2022

  2. [2]

    Hydra: Hybrid robot actions for imitation learning

    Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: Hybrid robot actions for imitation learning. In Proceed- ings of the Conference on Robot Learning (CoRL) , 2023

  3. [3]

    Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking

    Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab- hinav Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent: Generalization and efficiency in robot manip- ulation via semantic augmentations and action chunking. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2024

  4. [4]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Pe- ter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023

  5. [5]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision- language-action flow model for general robot control. arXiv preprint arXiv:2410.24164 , 2024

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yev- gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 , 2023

  8. [8]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video- language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158 , 2024

  9. [9]

    Berkeley UR5 demonstration dataset

    Lawrence Yunliang Chen, Simeon Adebola, and Ken Goldberg. Berkeley UR5 demonstration dataset. https: //sites.google.com/view/berkeley-ur5/home

  10. [10]

    Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. Ll3da: Visual interactive instruction tuning for omni-3d understanding reasoning and planning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  11. [11]

    Pali-x: On scaling up a multilingual vision and language model

    Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2024

  12. [12]

    Diffusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

  13. [13]

    Open x- embodiment: Robotic learning datasets and rt-x models

    Open X-Embodiment Collaboration, Abby O’Neill, Ab- dul Rehman, Abhiram Maddukuri, Abhishek Gupta, Ab- hishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x- embodiment: Robotic learning datasets and rt-x models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2024

  14. [14]

    From play to policy: Conditional behavior generation from uncurated robot data

    Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. From play to policy: Conditional behavior generation from uncurated robot data. In Proceedings of International Conference on Learning Representations (ICLR) , 2023

  15. [15]

    Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. URL https://github.com/ clvrai/clvr jaco play dataset

  16. [16]

    Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

    Ria Doshi, Homer Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. In Proceedings of the Conference on Robot Learning (CoRL), 2024

  17. [17]

    Bridge data: Boosting generalization of robotic skills with cross- domain datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Dani- ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross- domain datasets. In Proceedings of Robotics: Science and Systems (RSS) , 2022

  18. [18]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot. In RSS 2023 Workshop on Learning for Task and Motion Planning , 2023

  19. [19]

    Scene-llm: Extending language model for 3d visual understanding and reasoning,

    Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. Scene-llm: Extending language model for 3d visual understanding and reasoning. arXiv preprint arXiv:2403.11401, 2024

  20. [20]

    The organization of learning

    Charles R Gallistel. The organization of learning. The MIT Press, 1990

  21. [21]

    Polytask: Learning unified policies through behavior distillation.arXiv preprint arXiv:2310.08573,

    Siddhant Haldar and Lerrel Pinto. Polytask: Learning unified policies through behavior distillation. arXiv preprint arXiv:2310.08573, 2023

  22. [22]

    Baku: An efficient transformer for multi-task policy learning

    Siddhant Haldar, Zhuoran Peng, and Lerrel Pinto. Baku: An efficient transformer for multi-task policy learning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2024

  23. [23]

    Furniturebench: Reproducible real-world bench- mark for long-horizon complex manipulation

    Minho Heo, Youngwoon Lee, Doohyun Lee, and Joseph J Lim. Furniturebench: Reproducible real-world bench- mark for long-horizon complex manipulation. In Pro- ceedings of Robotics: Science and Systems (RSS) , 2023

  24. [24]

    3d- llm: Injecting the 3d world into large language models

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3d- llm: Injecting the 3d world into large language models. In Proceedings of the Conference on Neural Information Processing System (NeurIPS) , 2023

  25. [25]

    An embodied generalist agent in 3d world

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. In Proceedings of the International Conference on Machine Learning (ICML) , 2024

  26. [26]

    Bc-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2022

  27. [27]

    Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation

    Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL) , 2018

  28. [28]

    Pris- matic vlms: Investigating the design space of visually- conditioned language models

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually- conditioned language models. In Proceedings of the International Conference on Machine Learning (ICML) , 2024

  29. [29]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash- win Balakrishna, Sudeep Dasari, Siddharth Karam- cheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  30. [30]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024

  31. [31]

    CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

    Qixiu Li, Yaobo Liang, Zeyu Wang, Lin Luo, Xi Chen, Mozheng Liao, Fangyun Wei, Yu Deng, Sicheng Xu, Yizhong Zhang, et al. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv preprint arXiv:2411.19650, 2024

  32. [33]

    Towards generalist robot policies: What matters in building vision-language-action models.arXiv preprint arXiv:2412.14058, 2024

    Xinghang Li, Peiyan Li, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Tao Kong, Hanbo Zhang, and Huaping Liu. Towards generalist robot policies: What matters in building vision-language-action models. arXiv preprint arXiv:2412.14058 , 2024

  33. [34]

    Vision-language foundation models as effective robot imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators. In Proceedings of International Conference on Learning Representations (ICLR), 2024

  34. [35]

    Evaluating real-world robot manipulation policies in sim- ulation

    Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, and Ted Xiao. Evaluating real-world robot manipulation policies in sim- ulation. In Proceedings of the Conference on Robot Learning (CoRL), 2024

  35. [36]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. arXiv preprint arXiv:2306.03310, 2023

  36. [37]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Proceedings of the Conference on Neural Information Processing System (NeurIPS), 2024

  37. [38]

    Robot learning on the job: Human- in-the-loop autonomy and learning during deployment

    Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human- in-the-loop autonomy and learning during deployment. In Proceedings of Robotics: Science and Systems (RSS) , 2023

  38. [39]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 , 2024

  39. [40]

    Visuo-spatial working memory

    Robert H Logie. Visuo-spatial working memory . Psy- chology Press, 2014

  40. [41]

    Multi-stage cable routing through hierarchical imitation learning

    Jianlan Luo, Charles Xu, Xinyang Geng, Gilbert Feng, Kuan Fang, Liam Tan, Stefan Schaal, and Sergey Levine. Multi-stage cable routing through hierarchical imitation learning. IEEE Transactions on Robotics, 40:1476–1491, 2024

  41. [42]

    Fmb: a functional manipulation benchmark for generalizable robotic learning

    Jianlan Luo, Charles Xu, Fangchen Liu, Liam Tan, Zipeng Lin, Jeffrey Wu, Pieter Abbeel, and Sergey Levine. Fmb: a functional manipulation benchmark for generalizable robotic learning. The International Journal of Robotics Research , 2024

  42. [43]

    Interactive language: Talking to robots in real time

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time. IEEE Robotics and Automation Letters , 2023

  43. [44]

    Roboturk: A crowdsourcing platform for robotic skill learning through imitation

    Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Booher, Max Spero, Albert Tung, Julian Gao, John Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A crowdsourcing platform for robotic skill learning through imitation. In Proceedings of the Conference on Robot Learning (CoRL), 2018

  44. [45]

    Grounding language with visual affordances over un- structured data

    Oier Mees, Jessica Borja-Diaz, and Wolfram Burgard. Grounding language with visual affordances over un- structured data. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2023

  45. [46]

    Structured world models from human videos

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos. In Pro- ceedings of the Conference on Robot Learning (CoRL) , 2023

  46. [47]

    Learning and retrieval from prior data for skill- based imitation learning

    Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. In Proceedings of the Confer- ence on Robot Learning (CoRL) , 2023

  47. [48]

    Octo: An open-source generalist robot policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. In Proceedings of Robotics: Science an...

  48. [49]

    Actor-mimic: Deep multitask and transfer re- inforcement learning

    Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhut- dinov. Actor-mimic: Deep multitask and transfer re- inforcement learning. In Proceedings of International Conference on Learning Representations (ICLR) , 2016

  49. [50]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 , 2023

  50. [51]

    FAST: Efficient Action Tokenization for Vision-Language-Action Models

    Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokeniza- tion for vision-language-action models. arXiv preprint arXiv:2501.09747, 2025

  51. [52]

    Child’s Conception of Space: Selected Works vol 4

    Jean Piaget. Child’s Conception of Space: Selected Works vol 4. Routledge, 2013

  52. [53]

    Livescene: Language embedding interactive radiance fields for physical scene rendering and control

    Delin Qu, Qizhi Chen, Pingrui Zhang, Xianqiang Gao, Junzhe Li, Bin Zhao, Dong Wang, and Xuelong Li. Livescene: Language embedding interactive radiance fields for physical scene rendering and control. arXiv preprint arXiv:2406.16038, 2024

  53. [54]

    Shared control templates for assistive robotics

    Gabriel Quere, Annette Hagengruber, Maged Iskandar, Samuel Bustamante, Daniel Leidner, Freek Stulp, and J¨orn V ogel. Shared control templates for assistive robotics. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2020

  54. [55]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In Proceedings of the International Conference on Machine Learning (ICML) , 2021

  55. [56]

    Latent plans for task- agnostic offline reinforcement learning

    Erick Rosete-Beas, Oier Mees, Gabriel Kalweit, Joschka Boedecker, and Wolfram Burgard. Latent plans for task- agnostic offline reinforcement learning. In Proceedings of the Conference on Robot Learning (CoRL) , 2022

  56. [57]

    Policy distillation

    Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Raz- van Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. In Proceedings of International Conference on Learning Representations (ICLR), 2016

  57. [58]

    Multi-resolution sensing for real-time control with vision-language models

    Saumya Saxena, Mohit Sharma, and Oliver Kroe- mer. Multi-resolution sensing for real-time control with vision-language models. In Proceedings of the Confer- ence on Robot Learning (CoRL) , 2023

  58. [59]

    On bringing robots home

    Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Lerrel Pinto. On bringing robots home. arXiv preprint arXiv:2311.16098, 2023

  59. [60]

    Mutex: Learning unified policies from multimodal task specifications

    Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. Mutex: Learning unified policies from multimodal task specifications. In Proceedings of the Conference on Robot Learning (CoRL) , 2023

  60. [61]

    Perceiver-actor: A multi-task transformer for robotic ma- nipulation

    Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic ma- nipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2022

  61. [62]

    arXiv preprint arXiv:2412.03555 (2024) 1

    Andreas Steiner, Andr ´e Susano Pinto, Michael Tschan- nen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sher- bondy, Shangbang Long, et al. Paligemma 2: A family of versatile vlms for transfer. arXiv preprint arXiv:2412.03555, 2024

  62. [63]

    Cognitive maps in rats and men

    Edward C Tolman. Cognitive maps in rats and men. Psychological review, 55(4):189, 1948

  63. [64]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. In Proceedings of the Conference on Robot Learning (CoRL) , 2023

  64. [65]

    Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers

    Lirui Wang, Xinlei Chen, Jialiang Zhao, and Kaiming He. Scaling proprioceptive-visual learning with hetero- geneous pre-trained transformers. In Proceedings of the Conference on Neural Information Processing System (NeurIPS), 2024

  65. [66]

    ucsd kitchens dataset

    Ge Yan, Kris Wu, and Xiaolong Wang. ucsd kitchens dataset. https://github.com/geyan21/rlds dataset builder/ tree/main/ucsd kitchens, 2023

  66. [67]

    Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. arXiv preprint arXiv:2412.14171 , 2024

  67. [68]

    Sigmoid loss for language image pre- training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre- training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , 2023

  68. [69]

    3d-vla: A 3d vision-language-action generative world model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. In Proceedings of the International Conference on Machine Learning (ICML) , 2024

  69. [70]

    Universal actions for enhanced embodied foundation models

    Jinliang Zheng, Jianxiong Li, Dongxiu Liu, Yinan Zheng, Zhihao Wang, Zhonghong Ou, Yu Liu, Jingjing Liu, Ya- Qin Zhang, and Xianyuan Zhan. Universal actions for enhanced embodied foundation models. arXiv preprint arXiv:2501.10105, 2025

  70. [71]

    arXiv preprint arXiv:2412.10345 (2024)

    Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting en- hances spatial-temporal awareness for generalist robotic policies. arXiv preprint arXiv:2412.10345 , 2024

  71. [72]

    Train offline, test online: A real robot learning benchmark

    Gaoyue Zhou, Victoria Dean, Mohan Kumar Srirama, Aravind Rajeswaran, Jyothish Pari, Kyle Hatch, Aryan Jain, Tianhe Yu, Pieter Abbeel, Lerrel Pinto, et al. Train offline, test online: A real robot learning benchmark. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA) , 2023

  72. [73]

    \n\nQuestion:\n

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness. arXiv preprint arXiv:2409.18125, 2024

  73. [74]

    Fanuc manipulation: A dataset for learning-based manip- ulation with fanuc mate 200id robot

    Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc manipulation: A dataset for learning-based manip- ulation with fanuc mate 200id robot. https://sites.google. com/berkeley.edu/fanuc-manipulation, 2023

  74. [75]

    Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation

    Yifeng Zhu, Peter Stone, and Yuke Zhu. Bottom-up skill discovery from unsegmented demonstrations for long-horizon robot manipulation. IEEE Robotics and Automation Letters, 7(2):4126–4133, 2022

  75. [76]

    Learning generalizable manipulation policies with object-centric 3d representations

    Yifeng Zhu, Zhenyu Jiang, Peter Stone, and Yuke Zhu. Learning generalizable manipulation policies with object-centric 3d representations. In Proceedings of the Conference on Robot Learning (CoRL) , 2023

  76. [77]

    Viola: Imitation learning for vision-based manipulation with object proposal priors

    Yifeng Zhu, Abhishek Joshi, Peter Stone, and Yuke Zhu. Viola: Imitation learning for vision-based manipulation with object proposal priors. In Proceedings of the Conference on Robot Learning (CoRL) , 2023. SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model Supplementary Material APPENDIX A. Dataset Mixture Details Fig. 9 illust...

  77. [78]

    Zero-shot Robot Control Evaluation on WidowX Robot. As described in IV-A, we conducted extensive evaluations of 5 generalist robot manipulation policies across 7 zero- shot tasks, with 11 trials per task on a real-world BridgeV2 WidowX Robot. The specific task settings are: 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.92 0.94 0.96 0.98 0 20k 40k 60k 80k 100k 1...

  78. [79]

    As described in Sec

    Adapting to New Robot Setups on Franka Robot. As described in Sec. IV-B, we evaluated the performance of four methods - Diffusion Policy [12], Octo [48], OpenVLA [30], and SpatialVLA- across 13 real-world tasks on a Franka Panda Emika robot, with 11 trials per task. While Diffusion Policy was trained from scratch, Octo, OpenVLA and SpatialVLA were fine-tu...

  79. [80]

    Following Sec

    Spatial Understanding Capability Evaluation on Franka and WidowX Robot. Following Sec. IV-C, we conducted a comprehensive evaluation of spatial understanding capabilities through 3 zero-shot tasks on the BridgeData V2 WidowX Robot and 1 efficient-finetuning task on the Franka Robot. The detailed task specifications are: • Place plush toy closest to robot ...

  80. [81]

    SimplerEnv Evaluation. Tab. X presents the evaluation results of the simpler env on the Google robotic task, encompassing tasks such as Coke can manipulation (horizontal and vertical picking) and drawer operations (opening and closing). On average, SpatialVLA achieves the highest overall visual matching and variant aggre- gation performance with a signifi...

Showing first 80 references.