RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.
hub Canonical reference
Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots with Advanced Embodied Reasoning, Thinking, and Motion Transfer
Canonical reference. 80% of citing Pith papers cite this work as background.
abstract
General-purpose robots need a deep understanding of the physical world, advanced reasoning, and general and dexterous control. This report introduces the latest generation of the Gemini Robotics model family: Gemini Robotics 1.5, a multi-embodiment Vision-Language-Action (VLA) model, and Gemini Robotics-ER 1.5, a state-of-the-art Embodied Reasoning (ER) model. We are bringing together three major innovations. First, Gemini Robotics 1.5 features a novel architecture and a Motion Transfer (MT) mechanism, which enables it to learn from heterogeneous, multi-embodiment robot data and makes the VLA more general. Second, Gemini Robotics 1.5 interleaves actions with a multi-level internal reasoning process in natural language. This enables the robot to "think before acting" and notably improves its ability to decompose and execute complex, multi-step tasks, and also makes the robot's behavior more interpretable to the user. Third, Gemini Robotics-ER 1.5 establishes a new state-of-the-art for embodied reasoning, i.e., for reasoning capabilities that are critical for robots, such as visual and spatial understanding, task planning, and progress estimation. Together, this family of models takes us a step towards an era of physical agents-enabling robots to perceive, think and then act so they can solve complex multi-step tasks.
hub tools
citation-role summary
citation-polarity summary
years
2026 42representative citing papers
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
Introduces Neighbor Leakage Rate showing high trigger leakage in VLAS backdoors at 3% poisoning, caused by broad activation regions in fine-tuning that hard-negative samples can narrow.
Introduces pause-and-think-T dataset and pause-and-think-B benchmark; fine-tunes 4B VLM to 58% accuracy matching 235B model while generalizing out-of-distribution.
The paper introduces SP-VTP as a new setting for egocentric manipulation, releases the EgoSPT dataset with first-frame spatial annotations, and proposes the SPOT model that outperforms non-prompted baselines on cross-scene trajectory prediction.
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
PhysMem enables VLM-based robot planners to learn and verify physical properties through test-time interaction and hypothesis testing, raising success on a brick insertion task from 23% to 76%.
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
VLA models exhibit layer-wise redundancy allowing up to 50% depth compression via training-free CKA-based removal, yielding faster fine-tuning and inference with no performance loss on robot tasks.
RATs agents generate and solve their own exploratory tasks during play, distill successful code into a skill library, and reuse it to improve held-out task performance by 20.6 and 17.0 points on two benchmarks.
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
DIRECT is a multimodal-context router that allocates test-time compute across chain-of-thought depth, model size, and memory history for VLM embodied planners, improving the success-cost Pareto frontier and matching stronger models at up to 65% lower latency on benchmarks and a physical Franka arm.
A systematic study of hierarchical VLA agents identifies design principles that improve robot manipulation performance over flat and naive hierarchical baselines in simulation and real-world experiments.
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
SafePBDS uses pullback control barrier functions and a task manifold action interface to generate certifiably safe, steerable motions on high-DOF robots from objectives defined on arbitrary geometric spaces.
VLMs achieve 53-97% on rearrangement planning but only 6-45% on occlusion and under 7% on reflections, with failures localized to visual token compression after the vision encoder.
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
citing papers explorer
-
Beyond Attack Success Rate: Examining Trigger Leakage in Vision-Language Agentic Systems
Introduces Neighbor Leakage Rate showing high trigger leakage in VLAS backdoors at 3% poisoning, caused by broad activation regions in fine-tuning that hard-negative samples can narrow.