arxiv: 2503.20020 · v1 · submitted 2025-03-25 · 💻 cs.RO

Recognition: no theorem link

Gemini Robotics: Bringing AI into the Physical World

Acorn Pooley, Adil Dostmohamed, Alexander Herzog, Alex X. Lee, Allan Zhou, Anirudha Majumdar, Annie Xie, Anthony Brohan, Antoine Laurens, Arunkumar Byravan, Ashwin Balakrishna, Assaf Hurwitz Michaely, Atil Iscen, Ayzaan Wahid, Brandon Hernaez, Carolina Parada, Charles Shu, Chase Kew, Chuyuan Fu, Claudio Fantacci, Cody Fong, Coline Devin, Danny Driess, David D'Ambrosio, Debidatta Dwibedi, Deepali Jain, Dhruv Shah, Dmitry Kalashnikov, Dorsa Sadigh, Emilio Parisotto, Erik Frey, Federico Casarini, Fei Xia, Francesco Nori, Gemini Robotics Team, Giulia Vezzani, Grace Vesom, Hao-Tien Lewis Chiang, Isabel Leal, Jacky Liang, Jake Varley, Jan Humplik, Jean-Baptiste Alayrac, Jerad Kirkland, Jie Tan, Jingwei Zhang, Jinyu Xie, Jonathan Tompson, Jose Enrique Chen, Joshua Ainslie, Jost Tobias Springenberg, Kanishka Rao, Kathryn Shea, Keerthana Gopalakrishnan, Ken Caluwaerts, Konstantinos Bousmalis, Krista Reymann, Krzysztof Choromanski, Laura Graesser, Leonard Hasenclever, Maria Bauza, Marissa Giustina, M. Emre Karagozler, Michael Elabd, Michael Neunert, Michiel Blokzijl, Mithun George Jacob, Mohit Sharma, Montserrat Gonzalez Arenas, Nicolas Heess, Norman Di Palo, Oriol Vinyals, Oscar Chang, Pannag Sanketi, Paul Wohlhart, Peng Xu, Peter Pastor, Pierre Sermanet, Rachel Sterneck, Radu Soricut, R. Alex Hofer, Razvan Surdulescu, Robert Baruch, Robert Moreno, Rui Yao, Ryan Julian, Saminda Abeyruwan, Sean Kirmani, Sergey Yaroshenko, Serkan Cabi, Sharath Maddineni, Sichun Xu, Stefani Karp, Stefano Saliceti, Stefan Welker, Steven Bohez, Sudeep Dasari, Sumeet Singh, Ted Xiao, Thomas Buschmann, Thomas Lampe, Tianli Ding, Tingnan Zhang, Todor Davchev, Travis Armstrong, Tsang-Wei Edward Lee, Vikas Sindhwani, Vincent Vanhoucke, Wenhao Yu, Wentao Yuan, Xi Chen, Yilun Du, Ying Xu, Yixin Lin, Yuheng Kuang, Yuxiang Yang, Yuxiang Zhou, Zhuo Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 15:24 UTC · model grok-4.3

classification 💻 cs.RO

keywords roboticsvision-language-actionmultimodal modelsembodied reasoningrobot manipulationgeneralist AIfine-tuningsafety

0 comments

The pith

Gemini Robotics is a Vision-Language-Action model that directly controls robots to perform complex manipulation tasks in varied and unseen environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gemini Robotics, a family of models built on Gemini 2.0 for physical agents. It claims the core VLA model executes smooth, reactive movements across many manipulation tasks, stays robust to object and position changes, works in new environments, and follows open-vocabulary instructions. Fine-tuning then lets it handle long-horizon dexterous work, learn short tasks from about 100 examples, and switch to new robot bodies. This rests on a companion embodied reasoning model that adds spatial and temporal understanding for tasks like object detection and trajectory prediction. The result aims to move general AI into real-world robot control.

Core claim

Gemini Robotics, an advanced Vision-Language-Action generalist model built upon Gemini 2.0, executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. With additional fine-tuning, it can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because it builds on the Gemini Robotics-ER model, which extends Gemini's multimodal-reas

What carries the argument

The Vision-Language-Action (VLA) generalist model Gemini Robotics, which takes visual input and language instructions to output direct robot actions, supported by the Gemini Robotics-ER embodied reasoning model that adds spatial and temporal understanding for robotics tasks.

Load-bearing premise

That the model's robustness to variations and ability to generalize from limited demonstrations will hold when running on physical robots in uncontrolled real-world conditions.

What would settle it

Running the model on a physical robot in a new room with unfamiliar objects and instructions, then measuring whether task success rates match the reported levels.

read the original abstract

Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a Google tech report announcing Gemini Robotics VLA and ER models with big claims on generalization and few-shot adaptation, but almost no data or benchmarks to evaluate them.

read the letter

The key takeaway is that this paper announces two new models from the Gemini team: Gemini Robotics, a vision-language-action generalist built on Gemini 2.0 for direct robot control, and Gemini Robotics-ER, which adds embodied reasoning like grasp prediction, trajectory forecasting, and 3D understanding. They claim the system handles complex manipulation, open-vocab instructions, unseen environments, and can pick up new short-horizon tasks from roughly 100 demos or adapt to novel robot bodies after fine-tuning. The ER component is positioned as the foundation that makes the rest possible by extending Gemini's multimodal reasoning into spatial and temporal physical tasks. They also cover safety considerations for these foundation models. That combination of scale and stated adaptability is the main new element here compared to earlier VLA work. The write-up does a reasonable job sketching the intended use cases and why the embodied reasoning layer matters for robotics applications. It stays high-level and readable, which fits an industry report meant to signal direction. The soft spots are straightforward. The text provides no quantitative results, no benchmark tables, no comparisons to prior models like RT-2 or OpenVLA, and no details on training data scale or failure modes. Claims about smooth reactive control and robustness to object variation are stated without the experiments or ablations that would let a reader check them. This makes the soundness hard to assess from the document itself; it functions more as a capabilities overview than a verifiable research contribution. If the full version includes internal evals or real-robot videos with metrics, that would strengthen it considerably. This paper is mainly for robotics and multimodal AI researchers who want to track what a major lab is shipping and how they are framing the next steps toward generalist physical agents. It is not a self-contained study with reproducible findings, so it would not be the first thing I assign to students. A serious editor should still send it to peer review rather than desk-reject, because the topic is timely and the underlying models are likely substantial; referees would just need to press hard for the missing evidence and comparisons.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Gemini Robotics, a family of models built on Gemini 2.0 for robotics applications. It presents Gemini Robotics as a Vision-Language-Action (VLA) generalist model for direct robot control, claiming smooth reactive movements on complex manipulation tasks, robustness to object variations and positions, generalization to unseen environments, and adherence to open-vocabulary instructions. A second model, Gemini Robotics-ER, extends multimodal reasoning with spatial/temporal capabilities for tasks such as object detection, trajectory prediction, grasp prediction, and 3D bounding box estimation. The work further describes fine-tuning for long-horizon dexterous tasks, few-shot adaptation (as few as 100 demonstrations), embodiment transfer, and safety considerations.

Significance. If the performance and generalization claims hold under rigorous evaluation, the work would represent a meaningful advance in embodied AI by scaling large multimodal models to physical control. The explicit coupling of an embodied-reasoning module (ER) with action generation is a coherent architectural choice that could support more reliable long-horizon and open-vocabulary robot behavior. The emphasis on safety considerations for foundation-model robotics is also timely.

major comments (2)

Abstract: The central claims of 'smooth and reactive movements,' 'robust[ness] to variations in object types and positions,' 'handling unseen environments,' and 'learning new short-horizon tasks from as few as 100 demonstrations' are presented without any quantitative metrics, success rates, benchmark comparisons (e.g., against RT-X, OpenVLA, or RT-2), or ablation results. These assertions are load-bearing for the paper's contribution yet remain unsupported in the provided text.
Model and Capabilities sections: The description of Gemini Robotics-ER states that it enables 'object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions,' but supplies no architectural details, training objectives, data sources, or evaluation protocols for these embodied-reasoning tasks. Without such information, it is impossible to assess whether the claimed spatial/temporal enhancements are realized or how they integrate with the VLA policy.

minor comments (2)

The relationship between the base Gemini 2.0 model, Gemini Robotics-ER, and the final VLA policy should be illustrated with a diagram or explicit data-flow description to clarify which components are frozen versus fine-tuned.
Safety considerations are mentioned but lack concrete examples of failure modes tested or mitigation strategies deployed during real-robot deployment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important opportunities to strengthen the presentation of quantitative evidence and model specifics. We address each point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Abstract: The central claims of 'smooth and reactive movements,' 'robust[ness] to variations in object types and positions,' 'handling unseen environments,' and 'learning new short-horizon tasks from as few as 100 demonstrations' are presented without any quantitative metrics, success rates, benchmark comparisons (e.g., against RT-X, OpenVLA, or RT-2), or ablation results. These assertions are load-bearing for the paper's contribution yet remain unsupported in the provided text.

Authors: We agree that the abstract would be strengthened by including key quantitative results. The full manuscript reports detailed evaluations, including success rates on complex manipulation tasks, robustness metrics across object variations, generalization performance in unseen environments, and few-shot adaptation results with as few as 100 demonstrations. Direct comparisons to baselines such as RT-X, OpenVLA, and RT-2, along with ablations, appear in the Experiments section. We have revised the abstract to highlight representative metrics (e.g., task success rates and adaptation efficiency) while retaining its concise nature and directing readers to the full results. revision: yes
Referee: Model and Capabilities sections: The description of Gemini Robotics-ER states that it enables 'object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions,' but supplies no architectural details, training objectives, data sources, or evaluation protocols for these embodied-reasoning tasks. Without such information, it is impossible to assess whether the claimed spatial/temporal enhancements are realized or how they integrate with the VLA policy.

Authors: The manuscript contains a dedicated subsection on Gemini Robotics-ER that outlines the architectural extensions to Gemini 2.0 for spatial and temporal reasoning, the integration with the VLA policy for action generation, and the supported capabilities. Training uses a combination of supervised objectives on curated embodied datasets for tasks including trajectory and grasp prediction, with evaluations on standard protocols for object detection, 3D bounding boxes, and multi-view correspondence. To improve accessibility, we have expanded this section with additional specifics on the model architecture (e.g., module designs for spatial grounding), exact training objectives, primary data sources, and evaluation benchmarks, clarifying the coupling between the ER module and downstream control. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely descriptive model introduction

full rationale

The paper is a technical report describing the Gemini Robotics family of models and their capabilities. It contains no equations, derivations, predictions, or mathematical claims that could reduce to inputs by construction. All statements are high-level descriptions of architecture, training, and observed behaviors, with no self-referential logic or fitted parameters presented as independent results. Self-citations (e.g., to Gemini 2.0) are standard background references and not load-bearing for any deductive chain. The work is self-contained as an engineering report against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rely heavily on the untested assumption that the described fine-tuning and generalization properties will manifest in physical robot systems.

axioms (1)

domain assumption Gemini 2.0's multimodal capabilities can be extended to embodied reasoning and robot action generation.
This underpins the entire construction of the new models as described.

pith-pipeline@v0.9.0 · 6124 in / 1283 out tokens · 63860 ms · 2026-05-11T15:24:29.907364+00:00 · methodology

discussion (0)

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning
cs.RO 2026-05 unverdicted novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI
cs.AI 2026-04 unverdicted novelty 7.0

ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models
cs.RO 2026-03 unverdicted novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization
cs.RO 2026-05 unverdicted novelty 6.0

GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
cs.LG 2026-04 unverdicted novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
GazeVLA: Learning Human Intention for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.
Learning-augmented robotic automation for real-world manufacturing
cs.RO 2026-04 conditional novelty 6.0

A learning-augmented robotic system automated deformable cable insertion and soldering on a live electric-motor production line for 5 hours 10 minutes, producing 108 motors at 99.4% pass rate with under 20 minutes of ...
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

SafetyALFRED shows multimodal LLMs recognize kitchen hazards accurately in QA tests but achieve low success rates when required to mitigate those hazards through embodied planning.
Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction
cs.RO 2026-04 unverdicted novelty 6.0

COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning du...
Long-Term Memory for VLA-based Agents in Open-World Task Execution
cs.RO 2026-04 unverdicted novelty 6.0

ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
cs.CV 2026-04 conditional novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
cs.CV 2025-07 unverdicted novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
cs.RO 2025-02 unverdicted novelty 6.0

DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies
cs.LG 2026-05 unverdicted novelty 5.0

Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs
cs.CV 2026-05 unverdicted novelty 5.0

SFI-Bench shows current multimodal LLMs struggle to integrate spatial memory with functional reasoning and external knowledge in video tasks.
Context Unrolling in Omni Models
cs.CV 2026-04 unverdicted novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
ALAS: Adaptive Long-Horizon Action Synthesis via Async-pathway Stream Disentanglement
cs.RO 2026-04 unverdicted novelty 5.0

ALAS disentangles environment and self-state streams via bio-inspired modules to deliver 23% higher subtask success and 29% better execution efficiency on long-horizon HSI tasks.
Qwen3.5-Omni Technical Report
cs.CL 2026-04 unverdicted novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
cs.CV 2026-04 unverdicted novelty 5.0

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
DA-PTQ: Drift-Aware Post-Training Quantization for Efficient Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

DA-PTQ quantizes VLAs by compensating cross-space distortions and allocating mixed precision to minimize motion errors and kinematic drift in trajectories.
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
cs.CL 2026-04 unverdicted novelty 5.0

OpenSpatial supplies a principled open-source data engine and 3-million-sample dataset that raises spatial-reasoning model performance by an average of 19 percent on benchmarks.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
ARIS: Agentic and Relationship Intelligence System for Social Robots
cs.RO 2026-05 unverdicted novelty 4.0

ARIS integrates a graph-based Social World Model, RAG, and agentic architecture for social robots and reports higher user ratings for intelligence, animacy, anthropomorphism, and likeability than an LLM baseline in a ...
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 41 Pith papers · 1 internal anchor

[1]

URL https://openreview.net/forum?id=T8AiZj1QdN. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Jeff Dean

URL https://proceedings.mlr.press/v270/xu25b.html. Jeff Dean. Introducing Pathways: A next-generation AI architecture, 2021. URLhttps://blog. google/technology/ai/introducing-pathways-next-generation-ai-architecture/. MattDeitke,ChristopherClark,SanghoLee,RohunTripathi,YueYang,JaeSungPark,Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et...

work page doi:10.1109/lra.2024.3410155 2021
[3]

2021 , url =

URL https://arxiv.org/abs/2405.02292. UMI-Data. UMI-Data, 2024. URLhttps://umi-data.github.io/. Jake Varley, Sumeet Singh, Deepali Jain, Krzysztof Choromanski, Andy Zeng, Somnath Basu Roy Chowdhury, Avinava Dubey, and Vikas Sindhwani. Embodied AI with two arms: Zero-shot learning, safety and modularity. InIROS, pages 3651–3657. IEEE, 2024. ISBN 979-8-3503...

work page doi:10.1109/cvpr46437.2021 2024
[4]

in_frame

Contributions and Acknowledgments Authors Saminda Abeyruwan Joshua Ainslie Jean-Baptiste Alayrac Montserrat Gonzalez Arenas Travis Armstrong Ashwin Balakrishna Robert Baruch Maria Bauza Michiel Blokzijl Steven Bohez Konstantinos Bousmalis Anthony Brohan Thomas Buschmann Arunkumar Byravan Serkan Cabi Ken Caluwaerts Federico Casarini Oscar Chang Jose Enriqu...

work page 2019
[5]

The user will provide a task instruction along with an initial image of the workspace area from the overhead camera, initial robot state and initial scene objects

Receive instruction. The user will provide a task instruction along with an initial image of the workspace area from the overhead camera, initial robot state and initial scene objects

work page
[6]

Mention where the objects are located on the table

Describe the scene. Mention where the objects are located on the table

work page
[7]

Think about the best approach to execute the task provided the object locations, objectdimensions, robotembodimentconstraintsanddirectionguidelinesprovidedbelow

StepsPlanning. Think about the best approach to execute the task provided the object locations, objectdimensions, robotembodimentconstraintsanddirectionguidelinesprovidedbelow. Write down all of the steps you need to follow in detail to execute the task successfully with the robot. Each step should be as concise as possible and should contain a descriptio...

work page
[8]

After enumerating all the steps, write python code to execute each step for one step at a time on the robot using the API provided above

Steps Execution. After enumerating all the steps, write python code to execute each step for one step at a time on the robot using the API provided above. For each step:

work page
[9]

Rewrite a summary of the goal for the given step

work page
[10]

When grasping an object, follow the grasping guidelines provided below

work page
[11]

Describe your thought process

When moving a gripper to a specific position and orientation, make sure the target position is reachable according to the robot physical constraints described below and that there is enough clearance between other objects (including other gripper arms) to avoid collisions. Describe your thought process

work page
[12]

Write code to execute the given step on the robot using the api, this includes writing code to compute cartesian trajectories

work page
[13]

Summarize what the robot did as it executed the code based on the new image, robot state and initial scene objects as well as any execution error or user feedback

The code will be executed and you will be provided with a new image, the status of the execution and any error information that might have resulted from the code execution including anything printed to I/O. Summarize what the robot did as it executed the code based on the new image, robot state and initial scene objects as well as any execution error or u...

work page
[14]

If they align, continue with writing code

Compare your summary of what the robot did during code execution with the objective for that particular step. If they align, continue with writing code. If not, re-plan and write new steps to execute the task successfully. Consider the current state of the system when replanning (e.g., if a grasp failed the grippers may need to be reopened before attempti...

work page
[15]

In the world frame, front/back is along they axis, left/right is along thex axis and up/down is along thez axis with following directions: Positivex: Towards the right

Repeat steps 4.1-4.6 until you have completed all steps successfully. In the world frame, front/back is along they axis, left/right is along thex axis and up/down is along thez axis with following directions: Positivex: Towards the right. Negativex: Towards the left. 41 Gemini Robotics: Bringing AI into the Physical World Positivey: Towards front of the t...

work page
[16]

Gripper has two parallel 0.09m fingers that can open up to 0.065m

work page
[17]

The center of the table belongs to the (0, 0, 0) coordinate in world frame

The table area is 0.80 meters wide (from left to right) and 0.40 meters long (from front to back). The center of the table belongs to the (0, 0, 0) coordinate in world frame

work page
[18]

The left arm can only reach the left side of the table which belongs tox coordinates greater than -0.40 meters but less than 0.1 meters

work page
[19]

Grasp Guidelines:

The right arm can only reach the right side of the table which belongs tox coordinates greater than -0.1 meters but less than 0.40 meters. Grasp Guidelines:

work page
[20]

This grasp pose must be used to compute a pre-grasp pose

Always use theget_grasp_position_and_euler_orientation function to get the grasp po- sition and euler orientation for a specific object and gripper. This grasp pose must be used to compute a pre-grasp pose

work page
[21]

If the arms are blocking the object, move the arms out of the way before attempting the grasp

Clear visibility:Make sure the robot arms are not blocking the visibility of the object. If the arms are blocking the object, move the arms out of the way before attempting the grasp

work page
[22]

Reachability: Ensuring the gripper can reach the desired grasp points on the object given its arm length and workspace limits

work page
[23]

Make sure the gripper is open before going to the grasp pose

work page
[24]

left_gripper

Successful grasp:A successful grasp will be reflected in thedistance_between_fingers state of the robot. After closing the gripper the value ofdistance_between_fingers should be greater than 0 if the grippers are successfully enclosing the object. Robot API Interface Documentation: class Gripper(enum.Enum): LEFT = "left_gripper" RIGHT = "right_gripper" cl...

work page
[25]

**Open both grippers:** Ensure both grippers are open to prepare for grasping

work page
[26]

* Calculate a pre-grasp position slightly above the banana

**Grasp the banana with the right gripper:** * Detect the banana's position and orientation. * Calculate a pre-grasp position slightly above the banana. * Move the right gripper to the pre-grasp position. * Move the right gripper to the grasp position. * Close the right gripper

work page
[27]

This position should be clear of the bowl

**Move the banana to a handover position:** If the right arm cannot reach the bowl, move the banana to a handover position closer to the left arm. This position should be clear of the bowl

work page
[28]

**Open the right gripper:** Release the banana at the handover position

work page
[29]

**Move the right gripper out of the way:** Move the right gripper to a safe position to allow the left gripper to grasp the banana

work page
[30]

* Calculate a pre-grasp position slightly above the banana

**Grasp the banana with the left gripper:** * Detect the banana's position and orientation. * Calculate a pre-grasp position slightly above the banana. * Move the left gripper to the pre-grasp position. * Move the left gripper to the grasp position. * Close the left gripper

work page
[31]

* Move the left gripper to a position above the bowl

**Move the banana over the bowl:** * Detect the bowl's position. * Move the left gripper to a position above the bowl

work page
[32]

**Release the banana into the bowl:** Open the left gripper to drop the banana into the bowl

work page
[33]

right_gripper

**Move the left gripper to a safe position:** Move the left gripper to a safe position. Figure 32|Example of planning by Gemini whilst carrying out a robot control task. 44 Gemini Robotics: Bringing AI into the Physical World **Step 2: Grasp banana with left gripper** * **Goal:** Move the left gripper to the banana's grasp position and close the gripper. ...

work page
[34]

Put the top left green grapes into the right compartment of the grey box

adding typos, 2) translating the instruction to a different language (Spanish), 3) rephrasing the instruction, and 4) adding descriptive modifiers. See Figure 37 for detailed examples. Put the top left green grapes into the right compartment of the grey box. Put the top lft gren grapes into the rht comprtment of the grey bx. Coloque las uvas verdes de la ...

work page 2024