hub Canonical reference

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas · 2025 · cs.RO · arXiv 2510.03342

Canonical reference. 80% of citing Pith papers cite this work as background.

42 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 42 citing papers arXiv PDF

abstract

General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1 method 1

citation-polarity summary

background 8 baseline 2

representative citing papers

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation

cs.RO · 2026-06-11 · unverdicted · novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.

Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems

cs.CR · 2026-06-10 · unverdicted · novelty 7.0

Introduces Neighbor Leakage Rate showing high trigger leakage in VLAS backdoors at 3% poisoning, caused by broad activation regions in fine-tuning that hard-negative samples can narrow.

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

cs.CV · 2026-05-30 · unverdicted · novelty 7.0 · 2 refs

Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

The paper introduces SP-VTP as a new setting for egocentric manipulation, releases the EgoSPT dataset with first-frame spatial annotations, and proposes the SPOT model that outperforms non-prompted baselines on cross-scene trajectory prediction.

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

cs.RO · 2026-05-12 · conditional · novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

cs.RO · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.

KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning

cs.RO · 2026-04-28 · unverdicted · novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.

PhysMem: Scaling Test-Time Memory for Embodied Physical Reasoning

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

cs.AI · 2026-02-05 · unverdicted · novelty 7.0 · 2 refs

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.

Sequential Planning via Anchored Robotic Keypoints

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.

Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

cs.RO · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.

Vesta: A Generalist Embodied Reasoning Model

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.

Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think

cs.RO · 2026-06-18 · unverdicted · novelty 6.0

VLA models exhibit layer-wise redundancy allowing up to 50% depth compression via training-free CKA-based removal, yielding faster fine-tuning and inference with no performance loss on robot tasks.

Playful Agentic Robot Learning

cs.RO · 2026-06-17 · unverdicted · novelty 6.0

RATs agents generate and solve their own exploratory tasks during play, distill successful code into a skill library, and reuse it to improve held-out task performance by 20.6 and 17.0 points on two benchmarks.

Geometric Action Model for Robot Policy Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

DIRECT is a multimodal-context router that allocates test-time compute across chain-of-thought depth, model size, and memory history for VLM embodied planners, improving the success-cost Pareto frontier and matching stronger models at up to 65% lower latency on benchmarks and a physical Franka arm.

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

cs.RO · 2026-06-09 · unverdicted · novelty 6.0

A systematic study of hierarchical VLA agents identifies design principles that improve robot manipulation performance over flat and naive hierarchical baselines in simulation and real-world experiments.

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.

Safe and Steerable Geometric Motion Policies for Robotic Dexterous Manipulation

cs.RO · 2026-05-20 · unverdicted · novelty 6.0

SafePBDS uses pullback control barrier functions and a task manifold action interface to generate certifiably safe, steerable motions on high-DOF robots from objectives defined on arbitrary geometric spaces.

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

cs.CV · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

VLMs achieve 53-97% on rearrangement planning but only 6-45% on occlusion and under 7% on reflections, with failures localized to visual token compression after the vision encoder.

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems cs.CR · 2026-06-10 · unverdicted · none · ref 17 · internal anchor
Introduces Neighbor Leakage Rate showing high trigger leakage in VLAS backdoors at 3% poisoning, caused by broad activation regions in fine-tuning that hard-negative samples can narrow.

Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer