pith. machine review for the scientific record. sign in

arxiv: 2503.20020 · v1 · submitted 2025-03-25 · 💻 cs.RO

Recognition: no theorem link

Gemini Robotics: Bringing AI into the Physical World

Acorn Pooley, Adil Dostmohamed, Alexander Herzog, Alex X. Lee, Allan Zhou, Anirudha Majumdar, Annie Xie, Anthony Brohan, Antoine Laurens, Arunkumar Byravan, Ashwin Balakrishna, Assaf Hurwitz Michaely, Atil Iscen, Ayzaan Wahid, Brandon Hernaez, Carolina Parada, Charles Shu, Chase Kew, Chuyuan Fu, Claudio Fantacci, Cody Fong, Coline Devin, Danny Driess, David D'Ambrosio, Debidatta Dwibedi, Deepali Jain, Dhruv Shah, Dmitry Kalashnikov, Dorsa Sadigh, Emilio Parisotto, Erik Frey, Federico Casarini, Fei Xia, Francesco Nori, Gemini Robotics Team, Giulia Vezzani, Grace Vesom, Hao-Tien Lewis Chiang, Isabel Leal, Jacky Liang, Jake Varley, Jan Humplik, Jean-Baptiste Alayrac, Jerad Kirkland, Jie Tan, Jingwei Zhang, Jinyu Xie, Jonathan Tompson, Jose Enrique Chen, Joshua Ainslie, Jost Tobias Springenberg, Kanishka Rao, Kathryn Shea, Keerthana Gopalakrishnan, Ken Caluwaerts, Konstantinos Bousmalis, Krista Reymann, Krzysztof Choromanski, Laura Graesser, Leonard Hasenclever, Maria Bauza, Marissa Giustina, M. Emre Karagozler, Michael Elabd, Michael Neunert, Michiel Blokzijl, Mithun George Jacob, Mohit Sharma, Montserrat Gonzalez Arenas, Nicolas Heess, Norman Di Palo, Oriol Vinyals, Oscar Chang, Pannag Sanketi, Paul Wohlhart, Peng Xu, Peter Pastor, Pierre Sermanet, Rachel Sterneck, Radu Soricut, R. Alex Hofer, Razvan Surdulescu, Robert Baruch, Robert Moreno, Rui Yao, Ryan Julian, Saminda Abeyruwan, Sean Kirmani, Sergey Yaroshenko, Serkan Cabi, Sharath Maddineni, Sichun Xu, Stefani Karp, Stefano Saliceti, Stefan Welker, Steven Bohez, Sudeep Dasari, Sumeet Singh, Ted Xiao, Thomas Buschmann, Thomas Lampe, Tianli Ding, Tingnan Zhang, Todor Davchev, Travis Armstrong, Tsang-Wei Edward Lee, Vikas Sindhwani, Vincent Vanhoucke, Wenhao Yu, Wentao Yuan, Xi Chen, Yilun Du, Ying Xu, Yixin Lin, Yuheng Kuang, Yuxiang Yang, Yuxiang Zhou, Zhuo Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 15:24 UTC · model grok-4.3

classification 💻 cs.RO
keywords roboticsvision-language-actionmultimodal modelsembodied reasoningrobot manipulationgeneralist AIfine-tuningsafety
0
0 comments X

The pith

Gemini Robotics is a Vision-Language-Action model that directly controls robots to perform complex manipulation tasks in varied and unseen environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gemini Robotics, a family of models built on Gemini 2.0 for physical agents. It claims the core VLA model executes smooth, reactive movements across many manipulation tasks, stays robust to object and position changes, works in new environments, and follows open-vocabulary instructions. Fine-tuning then lets it handle long-horizon dexterous work, learn short tasks from about 100 examples, and switch to new robot bodies. This rests on a companion embodied reasoning model that adds spatial and temporal understanding for tasks like object detection and trajectory prediction. The result aims to move general AI into real-world robot control.

Core claim

Gemini Robotics, an advanced Vision-Language-Action generalist model built upon Gemini 2.0, executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. With additional fine-tuning, it can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because it builds on the Gemini Robotics-ER model, which extends Gemini's multimodal-reas

What carries the argument

The Vision-Language-Action (VLA) generalist model Gemini Robotics, which takes visual input and language instructions to output direct robot actions, supported by the Gemini Robotics-ER embodied reasoning model that adds spatial and temporal understanding for robotics tasks.

Load-bearing premise

That the model's robustness to variations and ability to generalize from limited demonstrations will hold when running on physical robots in uncontrolled real-world conditions.

What would settle it

Running the model on a physical robot in a new room with unfamiliar objects and instructions, then measuring whether task success rates match the reported levels.

read the original abstract

Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Gemini Robotics, a family of models built on Gemini 2.0 for robotics applications. It presents Gemini Robotics as a Vision-Language-Action (VLA) generalist model for direct robot control, claiming smooth reactive movements on complex manipulation tasks, robustness to object variations and positions, generalization to unseen environments, and adherence to open-vocabulary instructions. A second model, Gemini Robotics-ER, extends multimodal reasoning with spatial/temporal capabilities for tasks such as object detection, trajectory prediction, grasp prediction, and 3D bounding box estimation. The work further describes fine-tuning for long-horizon dexterous tasks, few-shot adaptation (as few as 100 demonstrations), embodiment transfer, and safety considerations.

Significance. If the performance and generalization claims hold under rigorous evaluation, the work would represent a meaningful advance in embodied AI by scaling large multimodal models to physical control. The explicit coupling of an embodied-reasoning module (ER) with action generation is a coherent architectural choice that could support more reliable long-horizon and open-vocabulary robot behavior. The emphasis on safety considerations for foundation-model robotics is also timely.

major comments (2)
  1. Abstract: The central claims of 'smooth and reactive movements,' 'robust[ness] to variations in object types and positions,' 'handling unseen environments,' and 'learning new short-horizon tasks from as few as 100 demonstrations' are presented without any quantitative metrics, success rates, benchmark comparisons (e.g., against RT-X, OpenVLA, or RT-2), or ablation results. These assertions are load-bearing for the paper's contribution yet remain unsupported in the provided text.
  2. Model and Capabilities sections: The description of Gemini Robotics-ER states that it enables 'object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions,' but supplies no architectural details, training objectives, data sources, or evaluation protocols for these embodied-reasoning tasks. Without such information, it is impossible to assess whether the claimed spatial/temporal enhancements are realized or how they integrate with the VLA policy.
minor comments (2)
  1. The relationship between the base Gemini 2.0 model, Gemini Robotics-ER, and the final VLA policy should be illustrated with a diagram or explicit data-flow description to clarify which components are frozen versus fine-tuned.
  2. Safety considerations are mentioned but lack concrete examples of failure modes tested or mitigation strategies deployed during real-robot deployment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important opportunities to strengthen the presentation of quantitative evidence and model specifics. We address each point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: Abstract: The central claims of 'smooth and reactive movements,' 'robust[ness] to variations in object types and positions,' 'handling unseen environments,' and 'learning new short-horizon tasks from as few as 100 demonstrations' are presented without any quantitative metrics, success rates, benchmark comparisons (e.g., against RT-X, OpenVLA, or RT-2), or ablation results. These assertions are load-bearing for the paper's contribution yet remain unsupported in the provided text.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports detailed evaluations, including success rates on complex manipulation tasks, robustness metrics across object variations, generalization performance in unseen environments, and few-shot adaptation results with as few as 100 demonstrations. Direct comparisons to baselines such as RT-X, OpenVLA, and RT-2, along with ablations, appear in the Experiments section. We have revised the abstract to highlight representative metrics (e.g., task success rates and adaptation efficiency) while retaining its concise nature and directing readers to the full results. revision: yes

  2. Referee: Model and Capabilities sections: The description of Gemini Robotics-ER states that it enables 'object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions,' but supplies no architectural details, training objectives, data sources, or evaluation protocols for these embodied-reasoning tasks. Without such information, it is impossible to assess whether the claimed spatial/temporal enhancements are realized or how they integrate with the VLA policy.

    Authors: The manuscript contains a dedicated subsection on Gemini Robotics-ER that outlines the architectural extensions to Gemini 2.0 for spatial and temporal reasoning, the integration with the VLA policy for action generation, and the supported capabilities. Training uses a combination of supervised objectives on curated embodied datasets for tasks including trajectory and grasp prediction, with evaluations on standard protocols for object detection, 3D bounding boxes, and multi-view correspondence. To improve accessibility, we have expanded this section with additional specifics on the model architecture (e.g., module designs for spatial grounding), exact training objectives, primary data sources, and evaluation benchmarks, clarifying the coupling between the ER module and downstream control. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely descriptive model introduction

full rationale

The paper is a technical report describing the Gemini Robotics family of models and their capabilities. It contains no equations, derivations, predictions, or mathematical claims that could reduce to inputs by construction. All statements are high-level descriptions of architecture, training, and observed behaviors, with no self-referential logic or fitted parameters presented as independent results. Self-citations (e.g., to Gemini 2.0) are standard background references and not load-bearing for any deductive chain. The work is self-contained as an engineering report against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rely heavily on the untested assumption that the described fine-tuning and generalization properties will manifest in physical robot systems.

axioms (1)
  • domain assumption Gemini 2.0's multimodal capabilities can be extended to embodied reasoning and robot action generation.
    This underpins the entire construction of the new models as described.

pith-pipeline@v0.9.0 · 6124 in / 1283 out tokens · 63860 ms · 2026-05-11T15:24:29.907364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

    cs.RO 2026-05 unverdicted novelty 7.0

    RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

  2. Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.

  3. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  4. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  5. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  6. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  7. [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

    cs.AI 2026-04 unverdicted novelty 7.0

    ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.

  8. VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

    cs.RO 2026-03 unverdicted novelty 7.0

    VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

  9. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  10. Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

    cs.CV 2026-05 unverdicted novelty 6.0

    Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...

  11. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.

  12. One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

    cs.CV 2026-05 unverdicted novelty 6.0

    Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

  13. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  14. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  15. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  16. RL Token: Bootstrapping Online RL with Vision-Language-Action Models

    cs.LG 2026-04 unverdicted novelty 6.0

    RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.

  17. GazeVLA: Learning Human Intention for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.

  18. Learning-augmented robotic automation for real-world manufacturing

    cs.RO 2026-04 conditional novelty 6.0

    A learning-augmented robotic system automated deformable cable insertion and soldering on a live electric-motor production line for 5 hours 10 minutes, producing 108 motors at 99.4% pass rate with under 20 minutes of ...

  19. dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

    cs.RO 2026-04 unverdicted novelty 6.0

    A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.

  20. SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.

  21. Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction

    cs.RO 2026-04 unverdicted novelty 6.0

    COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...

  22. Long-Term Memory for VLA-based Agents in Open-World Task Execution

    cs.RO 2026-04 unverdicted novelty 6.0

    ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.

  23. E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

    cs.CV 2026-04 conditional novelty 6.0

    E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.

  24. FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

    cs.LG 2026-04 unverdicted novelty 6.0

    FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...

  25. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  26. World Action Models are Zero-shot Policies

    cs.RO 2026-02 unverdicted novelty 6.0

    DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...

  27. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  28. GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    cs.CV 2025-07 unverdicted novelty 6.0

    GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

  29. DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

    cs.RO 2025-02 unverdicted novelty 6.0

    DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.

  30. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  31. Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

    cs.LG 2026-05 unverdicted novelty 5.0

    Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.

  32. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  33. From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.

  34. Context Unrolling in Omni Models

    cs.CV 2026-04 unverdicted novelty 5.0

    Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

  35. ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement

    cs.RO 2026-04 unverdicted novelty 5.0

    ALAS disentangles environment and self-state streams via bio-inspired modules to deliver 23% higher subtask success and 29% better execution efficiency on long-horizon HSI tasks.

  36. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...

  37. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  38. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  39. DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.

  40. The Cartesian Cut in Agentic AI

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.

  41. OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

    cs.CL 2026-04 unverdicted novelty 5.0

    OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.

  42. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

  43. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  44. ARIS: Agentic and Relationship Intelligence System for Social Robots

    cs.RO 2026-05 unverdicted novelty 4.0

    ARIS integrates a graph-based Social World Model, RAG, and agentic architecture for social robots and reports higher user ratings for intelligence, animacy, anthropomorphism, and likeability than an LLM baseline in a ...

  45. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 41 Pith papers · 1 internal anchor

  1. [1]

    URL https://openreview.net/forum?id=T8AiZj1QdN. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller,...

  2. [2]

    Jeff Dean

    URL https://proceedings.mlr.press/v270/xu25b.html. Jeff Dean. Introducing Pathways: A next-generation AI architecture, 2021. URLhttps://blog. google/technology/ai/introducing-pathways-next-generation-ai-architecture/. MattDeitke,ChristopherClark,SanghoLee,RohunTripathi,YueYang,JaeSungPark,Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et...

  3. [3]

    2021 , url =

    URL https://arxiv.org/abs/2405.02292. UMI-Data. UMI-Data, 2024. URLhttps://umi-data.github.io/. Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, and Vikas Sindhwani. Embodied AI with two arms: Zero-shot learning, safety and modularity. InIROS, pages 3651–3657. IEEE, 2024. ISBN 979-8-3503...

  4. [4]

    in_frame

    Contributions and Acknowledgments Authors Saminda Abeyruwan Joshua Ainslie Jean-Baptiste Alayrac Montserrat Gonzalez Arenas Travis Armstrong Ashwin Balakrishna Robert Baruch Maria Bauza Michiel Blokzijl Steven Bohez Konstantinos Bousmalis Anthony Brohan Thomas Buschmann Arunkumar Byravan Serkan Cabi Ken Caluwaerts Federico Casarini Oscar Chang Jose Enriqu...

  5. [5]

    The user will provide a task instruction along with an initial image of the workspace area from the overhead camera, initial robot state and initial scene objects

    Receive instruction. The user will provide a task instruction along with an initial image of the workspace area from the overhead camera, initial robot state and initial scene objects

  6. [6]

    Mention where the objects are located on the table

    Describe the scene. Mention where the objects are located on the table

  7. [7]

    Think about the best approach to execute the task provided the object locations, objectdimensions, robotembodimentconstraintsanddirectionguidelinesprovidedbelow

    StepsPlanning. Think about the best approach to execute the task provided the object locations, objectdimensions, robotembodimentconstraintsanddirectionguidelinesprovidedbelow. Write down all of the steps you need to follow in detail to execute the task successfully with the robot. Each step should be as concise as possible and should contain a descriptio...

  8. [8]

    After enumerating all the steps, write python code to execute each step for one step at a time on the robot using the API provided above

    Steps Execution. After enumerating all the steps, write python code to execute each step for one step at a time on the robot using the API provided above. For each step:

  9. [9]

    Rewrite a summary of the goal for the given step

  10. [10]

    When grasping an object, follow the grasping guidelines provided below

  11. [11]

    Describe your thought process

    When moving a gripper to a specific position and orientation, make sure the target position is reachable according to the robot physical constraints described below and that there is enough clearance between other objects (including other gripper arms) to avoid collisions. Describe your thought process

  12. [12]

    Write code to execute the given step on the robot using the api, this includes writing code to compute cartesian trajectories

  13. [13]

    Summarize what the robot did as it executed the code based on the new image, robot state and initial scene objects as well as any execution error or user feedback

    The code will be executed and you will be provided with a new image, the status of the execution and any error information that might have resulted from the code execution including anything printed to I/O. Summarize what the robot did as it executed the code based on the new image, robot state and initial scene objects as well as any execution error or u...

  14. [14]

    If they align, continue with writing code

    Compare your summary of what the robot did during code execution with the objective for that particular step. If they align, continue with writing code. If not, re-plan and write new steps to execute the task successfully. Consider the current state of the system when replanning (e.g., if a grasp failed the grippers may need to be reopened before attempti...

  15. [15]

    In the world frame, front/back is along they axis, left/right is along thex axis and up/down is along thez axis with following directions: Positivex: Towards the right

    Repeat steps 4.1-4.6 until you have completed all steps successfully. In the world frame, front/back is along they axis, left/right is along thex axis and up/down is along thez axis with following directions: Positivex: Towards the right. Negativex: Towards the left. 41 Gemini Robotics: Bringing AI into the Physical World Positivey: Towards front of the t...

  16. [16]

    Gripper has two parallel 0.09m fingers that can open up to 0.065m

  17. [17]

    The center of the table belongs to the (0, 0, 0) coordinate in world frame

    The table area is 0.80 meters wide (from left to right) and 0.40 meters long (from front to back). The center of the table belongs to the (0, 0, 0) coordinate in world frame

  18. [18]

    The left arm can only reach the left side of the table which belongs tox coordinates greater than -0.40 meters but less than 0.1 meters

  19. [19]

    Grasp Guidelines:

    The right arm can only reach the right side of the table which belongs tox coordinates greater than -0.1 meters but less than 0.40 meters. Grasp Guidelines:

  20. [20]

    This grasp pose must be used to compute a pre-grasp pose

    Always use theget_grasp_position_and_euler_orientation function to get the grasp po- sition and euler orientation for a specific object and gripper. This grasp pose must be used to compute a pre-grasp pose

  21. [21]

    If the arms are blocking the object, move the arms out of the way before attempting the grasp

    Clear visibility:Make sure the robot arms are not blocking the visibility of the object. If the arms are blocking the object, move the arms out of the way before attempting the grasp

  22. [22]

    Reachability: Ensuring the gripper can reach the desired grasp points on the object given its arm length and workspace limits

  23. [23]

    Make sure the gripper is open before going to the grasp pose

  24. [24]

    left_gripper

    Successful grasp:A successful grasp will be reflected in thedistance_between_fingers state of the robot. After closing the gripper the value ofdistance_between_fingers should be greater than 0 if the grippers are successfully enclosing the object. Robot API Interface Documentation: class Gripper(enum.Enum): LEFT = "left_gripper" RIGHT = "right_gripper" cl...

  25. [25]

    **Open both grippers:** Ensure both grippers are open to prepare for grasping

  26. [26]

    * Calculate a pre-grasp position slightly above the banana

    **Grasp the banana with the right gripper:** * Detect the banana's position and orientation. * Calculate a pre-grasp position slightly above the banana. * Move the right gripper to the pre-grasp position. * Move the right gripper to the grasp position. * Close the right gripper

  27. [27]

    This position should be clear of the bowl

    **Move the banana to a handover position:** If the right arm cannot reach the bowl, move the banana to a handover position closer to the left arm. This position should be clear of the bowl

  28. [28]

    **Open the right gripper:** Release the banana at the handover position

  29. [29]

    **Move the right gripper out of the way:** Move the right gripper to a safe position to allow the left gripper to grasp the banana

  30. [30]

    * Calculate a pre-grasp position slightly above the banana

    **Grasp the banana with the left gripper:** * Detect the banana's position and orientation. * Calculate a pre-grasp position slightly above the banana. * Move the left gripper to the pre-grasp position. * Move the left gripper to the grasp position. * Close the left gripper

  31. [31]

    * Move the left gripper to a position above the bowl

    **Move the banana over the bowl:** * Detect the bowl's position. * Move the left gripper to a position above the bowl

  32. [32]

    **Release the banana into the bowl:** Open the left gripper to drop the banana into the bowl

  33. [33]

    right_gripper

    **Move the left gripper to a safe position:** Move the left gripper to a safe position. Figure 32|Example of planning by Gemini whilst carrying out a robot control task. 44 Gemini Robotics: Bringing AI into the Physical World **Step 2: Grasp banana with left gripper** * **Goal:** Move the left gripper to the banana's grasp position and close the gripper. ...

  34. [34]

    Put the top left green grapes into the right compartment of the grey box

    adding typos, 2) translating the instruction to a different language (Spanish), 3) rephrasing the instruction, and 4) adding descriptive modifiers. See Figure 37 for detailed examples. Put the top left green grapes into the right compartment of the grey box. Put the top lft gren grapes into the rht comprtment of the grey bx. Coloque las uvas verdes de la ...