arxiv: 2510.03342 · v3 · submitted 2025-10-02 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team , Abbas Abdolmaleki , Saminda Abeyruwan , Joshua Ainslie , Jean-Baptiste Alayrac , Montserrat Gonzalez Arenas , Ashwin Balakrishna , Nathan Batchelor

show 164 more authors

Alex Bewley Jeff Bingham Michael Bloesch Konstantinos Bousmalis Philemon Brakel Anthony Brohan Thomas Buschmann Arunkumar Byravan Serkan Cabi Ken Caluwaerts Federico Casarini Christine Chan Oscar Chang London Chappellet-Volpini Jose Enrique Chen Xi Chen Hao-Tien Lewis Chiang Krzysztof Choromanski Adrian Collister David B. D'Ambrosio Sudeep Dasari Todor Davchev Meet Kirankumar Dave Coline Devin Norman Di Palo Tianli Ding Carl Doersch Adil Dostmohamed Yilun Du Debidatta Dwibedi Sathish Thoppay Egambaram Michael Elabd Tom Erez Xiaolin Fang Claudio Fantacci Cody Fong Erik Frey Chuyuan Fu Ruiqi Gao Marissa Giustina Keerthana Gopalakrishnan Laura Graesser Oliver Groth Agrim Gupta Roland Hafner Steven Hansen Leonard Hasenclever Sam Haves Nicolas Heess Brandon Hernaez Alex Hofer Jasmine Hsu Lu Huang Sandy H. Huang Atil Iscen Mithun George Jacob Deepali Jain Sally Jesmonth Abhishek Jindal Ryan Julian Dmitry Kalashnikov M. Emre Karagozler Stefani Karp Matija Kecman J. Chase Kew Donnie Kim Frank Kim Junkyung Kim Thomas Kipf Sean Kirmani Ksenia Konyushkova Li Yang Ku Yuheng Kuang Thomas Lampe Antoine Laurens Tuan Anh Le Isabel Leal Alex X. Lee Tsang-Wei Edward Lee Guy Lever Jacky Liang Li-Heng Lin Fangchen Liu Shangbang Long Caden Lu Sharath Maddineni Anirudha Majumdar Kevis-Kokitsi Maninis Andrew Marmon Sergio Martinez Assaf Hurwitz Michaely Niko Milonopoulos Joss Moore Robert Moreno Michael Neunert Francesco Nori Joy Ortiz Kenneth Oslund Carolina Parada Emilio Parisotto Amaris Paryag Acorn Pooley Thomas Power Alessio Quaglino Haroon Qureshi Rajkumar Vasudeva Raju Helen Ran Dushyant Rao Kanishka Rao Isaac Reid David Rendleman Krista Reymann Miguel Rivas Francesco Romano Yulia Rubanova Peter Pastor Sampedro Pannag R Sanketi Dhruv Shah Mohit Sharma Kathryn Shea Mohit Shridhar Charles Shu Vikas Sindhwani Sumeet Singh Radu Soricut Rachel Sterneck Ian Storz Razvan Surdulescu Jie Tan Jonathan Tompson Saran Tunyasuvunakool Jake Varley Grace Vesom Giulia Vezzani Maria Bauza Villalonga Oriol Vinyals Ren\'e Wagner Ayzaan Wahid Stefan Welker Paul Wohlhart Chengda Wu Markus Wulfmeier Fei Xia Ted Xiao Annie Xie Jinyu Xie Peng Xu Sichun Xu Ying Xu Zhuo Xu Jimmy Yan Sherry Yang Skye Yang Yuxiang Yang Hiu Hong Yu Wenhao Yu Wentao Yuan Yuan Yuan Jingwei Zhang Tingnan Zhang Zhiyuan Zhang Allan Zhou Guangyao Zhou Yuxiang Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:33 UTC · model grok-4.3

classification 💻 cs.RO

keywords Vision-Language-ActionEmbodied ReasoningMotion TransferMulti-embodiment learningGeneralist robotsTask planningPhysical agents

0 comments

The pith

Gemini Robotics 1.5 adds motion transfer and interleaved language reasoning to let multi-embodiment robots handle complex physical tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gemini Robotics 1.5 as a Vision-Language-Action model that uses a Motion Transfer mechanism to train on data from many different robot bodies at once. It also interleaves every action sequence with multi-level natural language reasoning so the model plans steps internally before moving. A companion Gemini Robotics-ER 1.5 model reaches new state-of-the-art results on embodied reasoning benchmarks for spatial understanding, task planning, and progress tracking. These pieces together aim to produce robots that perceive their surroundings, reason about goals, and execute multi-step actions more reliably than prior systems.

Core claim

A novel architecture equipped with a Motion Transfer mechanism lets the VLA model absorb heterogeneous data from multiple robot embodiments, while interleaving actions with internal natural-language reasoning steps improves decomposition of complex tasks and produces more interpretable behavior; the separate ER model then sets new performance records on the specific reasoning skills required for physical interaction.

What carries the argument

Motion Transfer (MT) mechanism that transfers learned motion patterns across different robot embodiments, combined with multi-level internal reasoning expressed in natural language before each action.

If this is right

Robots become able to break down and carry out longer sequences of actions without hand-crafted scripts.
Behavior becomes more transparent because the internal reasoning chain is expressed in readable language.
A single model can be deployed on robots with different physical forms after training on mixed data.
Embodied reasoning benchmarks improve on visual grounding, spatial relations, and step-by-step planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same motion-transfer approach could shorten the time needed to adapt the model to entirely new hardware platforms.
Visible language reasoning opens the possibility of real-time human correction during execution.
If the reasoning layer generalizes, similar interleaving might improve other embodied agents such as autonomous vehicles or manipulators in warehouses.

Load-bearing premise

Benchmark gains from motion transfer and interleaved reasoning will carry over to unstructured real-world settings containing objects, lighting, and dynamics absent from training data.

What would settle it

Place the robot in a previously unseen room with novel objects and changed lighting, then measure whether it still completes the same multi-step tasks it succeeded on in controlled benchmarks.

read the original abstract

General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gemini Robotics 1.5 adds a motion transfer mechanism for multi-embodiment data and interleaves natural language reasoning in the VLA loop, but the abstract gives no metrics or ablations to check the generalization claims.

read the letter

The main thing here is that Gemini Robotics 1.5 uses a Motion Transfer setup to train on data from different robot bodies and inserts multi-level natural language reasoning steps before generating actions. The companion ER 1.5 model targets embodied reasoning skills like planning and progress tracking. Together they aim at robots that can handle complex multi-step tasks across hardware types.

Referee Report

2 major / 1 minor

Summary. The paper introduces Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model with a novel architecture and Motion Transfer (MT) mechanism designed to learn from heterogeneous robot data, along with interleaved multi-level natural language reasoning to enable 'thinking before acting' for complex tasks. It also presents Gemini Robotics-ER 1.5 as achieving state-of-the-art performance on embodied reasoning benchmarks covering visual/spatial understanding, task planning, and progress estimation. The overall goal is advancing generalist physical agents capable of perception, reasoning, and dexterous control.

Significance. If the claimed generalization benefits from MT and the interleaved reasoning hold under rigorous testing, the work would mark a meaningful advance in multi-embodiment VLAs by addressing embodiment-specific data heterogeneity. The emphasis on interpretable internal reasoning is a positive direction for robot transparency. However, the absence of any quantitative metrics, ablation studies, or cross-embodiment transfer results in the provided text leaves the central performance claims unverified and limits assessment of whether MT genuinely enables embodiment-agnostic representations beyond what larger data or model scale would achieve.

major comments (2)

[Abstract] Abstract: The central claim that the Motion Transfer (MT) mechanism 'enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general' is not supported by any ablation studies isolating MT's contribution, cross-embodiment transfer metrics (e.g., success rates when training on one embodiment and testing on another), or description of the latent alignment procedure. Without these, benchmark gains cannot be confidently attributed to MT rather than data volume or architecture scale.
[Abstract] Abstract: The assertion that Gemini Robotics-ER 1.5 'establishes a new state-of-the-art for embodied reasoning' and that the overall family 'takes us a step towards an era of physical agents' is presented without any quantitative results, baseline comparisons, or evaluation protocols. This renders the performance and generalization claims unverifiable from the manuscript as presented.

minor comments (1)

[Abstract] The abstract uses several forward-looking phrases ('pushing the frontier', 'era of physical agents') that could be toned down to focus strictly on the technical contributions and measured results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the abstract. We address each point below and will revise the manuscript to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the Motion Transfer (MT) mechanism 'enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general' is not supported by any ablation studies isolating MT's contribution, cross-embodiment transfer metrics (e.g., success rates when training on one embodiment and testing on another), or description of the latent alignment procedure. Without these, benchmark gains cannot be confidently attributed to MT rather than data volume or architecture scale.

Authors: We agree that isolating the contribution of MT requires explicit ablations and cross-embodiment transfer results. The full manuscript describes the MT architecture and latent alignment procedure in detail and provides qualitative demonstrations of multi-embodiment learning. However, we acknowledge the absence of quantitative ablations and transfer metrics in the current version. We will add a dedicated ablation study section reporting success rates for training on one embodiment and evaluating on others, along with comparisons to scale-only baselines. revision: yes
Referee: [Abstract] Abstract: The assertion that Gemini Robotics-ER 1.5 'establishes a new state-of-the-art for embodied reasoning' and that the overall family 'takes us a step towards an era of physical agents' is presented without any quantitative results, baseline comparisons, or evaluation protocols. This renders the performance and generalization claims unverifiable from the manuscript as presented.

Authors: The abstract summarizes results that are quantified in the main body, where Gemini Robotics-ER 1.5 is evaluated on embodied reasoning benchmarks with direct baseline comparisons and described evaluation protocols. We will revise the abstract to include specific quantitative improvements (e.g., accuracy deltas on visual/spatial, planning, and progress estimation tasks) and a brief reference to the evaluation section so that the SOTA claim is verifiable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces Gemini Robotics 1.5 as a multi-embodiment VLA model featuring a Motion Transfer mechanism and interleaved natural-language reasoning, plus a separate Embodied Reasoning model. All central claims are supported by descriptions of training procedures and empirical benchmark results rather than mathematical derivations, equations, or self-referential definitions. No steps reduce predictions or uniqueness claims to fitted inputs or prior self-citations by construction; the argument chain remains self-contained through external data and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical large-model report. Central claims rest on standard deep learning assumptions about generalization from large-scale training data and the effectiveness of the described architecture; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 6363 in / 1073 out tokens · 32601 ms · 2026-05-16T07:33:02.554670+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
cs.CV 2026-01 unverdicted novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation
cs.RO 2026-05 conditional novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents
cs.RO 2026-05 unverdicted novelty 7.0

MemCompiler introduces state-conditioned memory compilation that dynamically selects and compiles relevant memory into text and latent guidance, yielding up to 129% gains over no-memory baselines and 60% lower latency...
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 7.0

MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning
cs.RO 2026-02 unverdicted novelty 7.0

PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
cs.AI 2026-02 unverdicted novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
cs.RO 2026-05 unverdicted novelty 6.0

VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
MolmoAct2: Action Reasoning Models for Real-world Deployment
cs.RO 2026-05 unverdicted novelty 6.0

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
Source-Modality Monitoring in Vision-Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
cs.CV 2026-04 unverdicted novelty 6.0

LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 conditional novelty 6.0

FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
cs.RO 2026-01 unverdicted novelty 6.0

CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
cs.CV 2026-05 unverdicted novelty 5.0

Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...
Cooptimizing Safety and Performance Using Safety Value-Constrained Model Predictive Control
cs.RO 2026-04 unverdicted novelty 5.0

Augments MPC with a safety value function terminal constraint to achieve recursive feasibility and persistent safety while co-optimizing performance.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning
cs.RO 2026-04 unverdicted novelty 5.0

RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.