hub Baseline reference

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu · 2023 · cs.AI · arXiv 2306.03310

Baseline reference. 59% of citing Pith papers use this work as a benchmark or comparison.

69 Pith papers citing it

Baseline 59% of classified citations

open full Pith review browse 69 citing papers arXiv PDF

abstract

Lifelong learning offers a promising paradigm of building a generalist agent that learns and adapts over its lifespan. Unlike traditional lifelong learning problems in image and text domains, which primarily involve the transfer of declarative knowledge of entities and concepts, lifelong learning in decision-making (LLDM) also necessitates the transfer of procedural knowledge, such as actions and behaviors. To advance research in LLDM, we introduce LIBERO, a novel benchmark of lifelong learning for robot manipulation. Specifically, LIBERO highlights five key research topics in LLDM: 1) how to efficiently transfer declarative knowledge, procedural knowledge, or the mixture of both; 2) how to design effective policy architectures and 3) effective algorithms for LLDM; 4) the robustness of a lifelong learner with respect to task ordering; and 5) the effect of model pretraining for LLDM. We develop an extendible procedural generation pipeline that can in principle generate infinitely many tasks. For benchmarking purpose, we create four task suites (130 tasks in total) that we use to investigate the above-mentioned research topics. To support sample-efficient learning, we provide high-quality human-teleoperated demonstration data for all tasks. Our extensive experiments present several insightful or even unexpected discoveries: sequential finetuning outperforms existing lifelong learning methods in forward transfer, no single visual encoder architecture excels at all types of knowledge transfer, and naive supervised pretraining can hinder agents' performance in the subsequent LLDM. Check the website at https://libero-project.github.io for the code and the datasets.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 13 background 5 baseline 2 method 2

citation-polarity summary

use dataset 11 background 7 baseline 2 use method 2

representative citing papers

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies

cs.RO · 2026-06-02 · unverdicted · novelty 7.0

DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.

BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.

Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

cs.RO · 2026-05-12 · conditional · novelty 7.0

A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

cs.RO · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Atomic-Probe Governance for Skill Updates in Compositional Robot Policies

cs.RO · 2026-04-29 · unverdicted · novelty 7.0 · 2 refs

A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.

Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models

cs.RO · 2026-04-28 · unverdicted · novelty 7.0

Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

cs.RO · 2026-04-27 · unverdicted · novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations

cs.RO · 2026-04-11 · unverdicted · novelty 7.0

STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

cs.RO · 2026-03-23 · unverdicted · novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

cs.RO · 2025-06-18 · unverdicted · novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.

Sequential Planning via Anchored Robotic Keypoints

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.

The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks

cs.RO · 2026-06-26 · unverdicted · novelty 6.0 · 2 refs

TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.

SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

SSI-Policy uses an RGB-only Structured Scene Interface to improve LIBERO benchmark performance by nearly 15% with only 10 demonstrations per task compared to prior methods.

Flash-WAM: Modality-Aware Distillation for World Action Models

cs.LG · 2026-06-03 · unverdicted · novelty 6.0

Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.

Lngram: N-gram Conditional Memory in Latent Space

cs.CL · 2026-05-24 · unverdicted · novelty 6.0

Lngram is a latent N-gram conditional memory module that learns discrete symbols from hidden states for N-gram lookup, outperforming baselines in language modeling and multimodal tasks.

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

cs.RO · 2026-05-18 · conditional · novelty 6.0 · 2 refs

COBALT enables scalable crowdsourced teleoperation of robots using smartphones, supporting concurrent users with low latency and yielding a 7500+ demonstration dataset validated on imitation learning tasks.

citing papers explorer

Showing 50 of 69 citing papers.

Denoising Tells When to Replan: Denoising-Variance Adaptive Chunking for Flow-Based Robot Policies cs.RO · 2026-06-02 · unverdicted · none · ref 21 · internal anchor
DVAC uses denoising variance as an intrinsic signal to adaptively chunk actions in flow-based robot policies, improving success rates and cutting replans on LIBERO, RoboTwin, CALVIN, and real-world tasks.
BOKBO (Best of K Bad Options): Calibrated Abstention for VLA Policies cs.LG · 2026-05-28 · unverdicted · none · ref 12 · internal anchor
BOKBO is the first conformal abstention method for K-sample VLA policies that supplies finite-sample distribution-free guarantees on executed violation rates, with global and Mondrian per-task variants.
Learn Where Outcomes Diverge: Efficient VLA RL via Probabilistic Chunk Masking cs.LG · 2026-05-15 · unverdicted · none · ref 16 · internal anchor
PCM uses success-failure action variance to probabilistically select and mask chunks for gradient updates in GRPO, matching standard success rates with 2.38x wall-clock speedup and 60% lower memory on LIBERO benchmarks.
Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation cs.RO · 2026-05-12 · conditional · none · ref 14 · internal anchor
A liveness-based Bellman operator enables conservative offline policy evaluation for manipulation tasks by encoding task progression and reducing truncation bias from finite horizons.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models cs.RO · 2026-05-12 · unverdicted · none · ref 13 · 2 links · internal anchor
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · conditional · none · ref 32 · 3 links · internal anchor
Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 45 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Atomic-Probe Governance for Skill Updates in Compositional Robot Policies cs.RO · 2026-04-29 · unverdicted · none · ref 33 · 2 links · internal anchor
A cross-version swap protocol reveals dominant skills that swing composition success by up to 50 percentage points, and an atomic probe with selective revalidation governs updates at lower cost than always re-testing full compositions.
Privileged Foresight Distillation: Zero-Cost Future Correction for World Action Models cs.RO · 2026-04-28 · unverdicted · none · ref 10 · internal anchor
Privileged Foresight Distillation distills the residual difference in action predictions with versus without future context into a current-only adapter, yielding consistent gains on LIBERO and RoboTwin benchmarks.
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies cs.CV · 2026-04-27 · unverdicted · none · ref 26 · internal anchor
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment cs.RO · 2026-04-27 · unverdicted · none · ref 11 · internal anchor
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 25 · internal anchor
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
STRONG-VLA: Decoupled Robustness Learning for Vision-Language-Action Models under Multimodal Perturbations cs.RO · 2026-04-11 · unverdicted · none · ref 10 · internal anchor
STRONG-VLA uses decoupled two-stage training to improve VLA model robustness, yielding up to 16% higher task success rates under seen and unseen perturbations on the LIBERO benchmark.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models cs.RO · 2026-03-23 · unverdicted · none · ref 25 · internal anchor
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning cs.RO · 2025-06-18 · unverdicted · none · ref 95 · internal anchor
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
Sequential Planning via Anchored Robotic Keypoints cs.RO · 2026-06-29 · unverdicted · none · ref 6 · internal anchor
SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks cs.RO · 2026-06-26 · unverdicted · none · ref 54 · 2 links · internal anchor
TISED decomposes inference optimization effects on embodied tasks and identifies paradoxical outcomes where faster per-step inference can increase task completion time on static tasks or raise success rates on dynamic tasks.
SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation cs.RO · 2026-06-25 · unverdicted · none · ref 12 · internal anchor
SSI-Policy uses an RGB-only Structured Scene Interface to improve LIBERO benchmark performance by nearly 15% with only 10 demonstrations per task compared to prior methods.
Flash-WAM: Modality-Aware Distillation for World Action Models cs.LG · 2026-06-03 · unverdicted · none · ref 21 · internal anchor
Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 31 · internal anchor
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs cs.RO · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.
RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models cs.RO · 2026-06-01 · unverdicted · none · ref 22 · internal anchor
RoboSemanticBench reveals that representative VLA models grasp blocks successfully but select the semantically correct answer at near-random rates, indicating a gap between backbone semantics and action prediction.
Lngram: N-gram Conditional Memory in Latent Space cs.CL · 2026-05-24 · unverdicted · none · ref 8 · internal anchor
Lngram is a latent N-gram conditional memory module that learns discrete symbols from hidden states for N-gram lookup, outperforming baselines in language modeling and multimodal tasks.
COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones cs.RO · 2026-05-18 · conditional · none · ref 21 · 2 links · internal anchor
COBALT enables scalable crowdsourced teleoperation of robots using smartphones, supporting concurrent users with low latency and yielding a 7500+ demonstration dataset validated on imitation learning tasks.
DexHoldem: Playing Texas Hold'em with Dexterous Embodied System cs.RO · 2026-05-18 · unverdicted · none · ref 33 · internal anchor
DexHoldem is a new benchmark providing 1,470 teleoperated demonstrations across 14 manipulation primitives, plus standardized tests for dexterous policy execution and agentic perception in a physical Texas Hold'em setting.
Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation cs.CV · 2026-05-15 · conditional · none · ref 15 · internal anchor
VLA-AD distills 7B VLA teachers into 158M students using offline VLM semantic guidance on task phases and directions, matching teacher performance on LIBERO with 44x size reduction and 3.28x speedup.
FrameSkip: Learning from Fewer but More Informative Frames in VLA Training cs.RO · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
FrameSkip improves VLA policy training success from 66.50% to 76.15% by selecting high-importance frames and retaining only 20% of unique frames across three benchmarks.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models cs.RO · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning cs.RO · 2026-05-12 · unverdicted · none · ref 53 · internal anchor
TMRL bridges behavioral cloning pretraining and RL finetuning via diffusion noise and timestep modulation to enable controlled exploration, improving sample efficiency and enabling real-world robot training in under one hour.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 31 · 2 links · internal anchor
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
BEACON: Cross-Domain Co-Training of Generative Robot Policies via Best-Effort Adaptation cs.RO · 2026-05-09 · unverdicted · none · ref 32 · 2 links · internal anchor
BEACON uses discrepancy-aware importance reweighting to jointly train diffusion-based robot policies and source sample weights, improving performance over target-only and fixed-ratio baselines in cross-domain manipulation tasks.
Predictive but Not Plannable: RC-aux for Latent World Models cs.LG · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation cs.RO · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 47 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 20 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors cs.RO · 2026-04-23 · unverdicted · none · ref 17 · internal anchor
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 43 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
RoboPlayground: Democratizing Robotic Evaluation through Structured Physical Domains cs.RO · 2026-04-06 · unverdicted · none · ref 22 · internal anchor
RoboPlayground reframes robotic manipulation evaluation as a language-driven process over structured physical domains, letting users author varied yet reproducible tasks that reveal policy generalization failures.
Gaze-Regularized Vision-Language-Action Models for Robotic Manipulation cs.CV · 2026-03-24 · unverdicted · none · ref 29 · internal anchor
Gaze regularization aligns VLA attention with human visual patterns via KL divergence on patch distributions, yielding 4-12% gains on manipulation benchmarks.
Fast-WAM: Do World Action Models Need Test-time Future Imagination? cs.CV · 2026-03-17 · unverdicted · none · ref 40 · internal anchor
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons cs.RO · 2026-03-02 · unverdicted · none · ref 66 · internal anchor
Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control cs.RO · 2026-02-13 · unverdicted · none · ref 74 · internal anchor
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning cs.CV · 2026-02-11 · unverdicted · none · ref 29 · internal anchor
ABot-M0 unifies heterogeneous robot data into a 6-million-trajectory dataset and introduces Action Manifold Learning to predict stable actions on a low-dimensional manifold using a DiT backbone.
Flexible Multitask Learning with Factorized Diffusion Policy cs.RO · 2025-12-26 · unverdicted · none · ref 27 · internal anchor
A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs cs.RO · 2025-12-17 · unverdicted · none · ref 34 · internal anchor
mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models cs.LG · 2025-10-31 · unverdicted · none · ref 23 · internal anchor
DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
Co-Evolving Latent Action World Models cs.LG · 2025-10-30 · unverdicted · none · ref 23 · internal anchor
CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models cs.RO · 2025-07-31 · unverdicted · none · ref 40 · internal anchor
villa-X enhances latent action modeling in VLA models to support zero-shot action planning for unseen robot embodiments and open-vocabulary instructions, yielding better manipulation results in simulation and real-world tests.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models cs.CV · 2025-03-27 · unverdicted · none · ref 38 · internal anchor
CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning cs.AI · 2025-03-18 · conditional · none · ref 32 · internal anchor
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer