hub Mixed citations

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng · 2025 · cs.RO · arXiv 2510.10274

Mixed citation behavior. Most common role is background (60%).

99 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 99 citing papers arXiv PDF

abstract

Successful generalist Vision-Language-Action (VLA) models rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders, enjoying both scalability and simplicity. Evaluated across 6 simulations as well as 3 real-world robots, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves SOTA performance over a sweep of benchmarks, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks. Website: https://thu-air-dream.github.io/X-VLA/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 baseline 7

citation-polarity summary

background 12 baseline 7 unclear 1

representative citing papers

LIBERO-Safety: A Comprehensive Benchmark for Physical and Semantic Safety in Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

LIBERO-Safety supplies a scalable benchmark, data-generation pipeline, and 19,664-demonstration dataset that exposes a generalization-safety tension in current VLA models where diverse training improves collision avoidance but task success stays limited by trajectory quality and semantic understandi

Cloak: Zero-Shot Cross-Embodiment Manipulation by Masking the End-Effector from the VLA

cs.RO · 2026-06-22 · unverdicted · novelty 7.0

Masking the end-effector from wrist views during training lets a single-gripper VLA transfer zero-shot to other grippers, arms, and five-fingered hands while keeping original performance.

Frequency-Aware Flow Matching for Continuous and Consistent Robotic Action Generation

cs.RO · 2026-06-18 · unverdicted · novelty 7.0

FAFM performs flow matching in the frequency domain using DCT on action sequences to produce continuous temporally consistent robotic actions with a Sobolev-style smoothness regularizer.

EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

EBench is a benchmark that evaluates generalist mobile manipulation policies on 26 tasks across 5 capability and 4 generalization dimensions, revealing distinct capability profiles among models with similar success rates.

ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.

DuoBench: A Reproducible Benchmark for Bimanual Manipulation in Simulation and the Real World

cs.RO · 2026-06-10 · unverdicted · novelty 7.0

DuoBench introduces eleven bimanual manipulation tasks with stage-based evaluation and human datasets to benchmark imitation-learning and vision-language-action policies on dual-arm robots in sim and real settings.

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

X-Tokenizer creates semantic action tokens via asymmetric residual quantization and contrastive pretraining on large trajectory data, outperforming prior methods like FAST on robotic tasks.

ActionMap: Robot Policy Learning via Voxel Action Heatmap

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

ActionMap introduces a voxel heatmap action head for VLA models that improves policy learning by exploiting geometric structure in the action space.

Can VLA Models Learn from Real-World Data Continually without Forgetting?

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

VLA models exhibit catastrophic forgetting on a new real-world dataset of four sequential manipulation tasks, with experience replay implementation factors evaluated for mitigation.

RotVLA: Rotational Latent Action for Vision-Language-Action Model

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

cs.RO · 2026-05-12 · conditional · novelty 7.0 · 2 refs

GridS is a plug-and-play differentiable module for geometry-aware visual token resampling in VLA models that achieves under 10% token retention and 76% FLOPs reduction with no success-rate loss.

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

RIO introduces a lightweight open-source framework that abstracts real-time robot I/O to support easy switching between embodiments and platforms for collecting data and deploying VLAs.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · conditional · novelty 7.0 · 3 refs

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

Modular Sensory Stream for Integrating Physical Feedback in Vision-Language-Action Models

cs.RO · 2026-04-25 · unverdicted · novelty 7.0

MoSS augments VLAs with decoupled modality streams for multiple physical signals, achieving synergistic gains in real-world robot tasks via joint attention and auxiliary future-signal prediction.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

cs.LG · 2026-02-23 · unverdicted · novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.

VLA-Corrector: Lightweight Detect-and-Correct Inference for Adaptive Action Horizon

cs.RO · 2026-07-02 · unverdicted · novelty 6.0

VLA-Corrector adds a detect-and-correct inference layer using a latent vision monitor and online gradient guidance to enable adaptive action horizons in chunked VLA policies.

UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.

Training Vision-Language-Action Models with Dense Embodied Chain-of-Thought Supervision

cs.RO · 2026-06-29 · unverdicted · novelty 6.0 · 4 refs

ZR-0 is a dual-stream VLA model trained with dense ECoT supervision on 60M frames from 400K trajectories to enable cross-embodiment transfer in simulation and real-world settings.

Critical Interval MSE: Toward Reliable Offline Validation for Robot Manipulation Policies

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

CI-MSE improves Spearman's rank correlation between offline validation error and real rollout performance from -0.61 (raw MSE) to -0.87 across policy checkpoints in simulation and real-world robot manipulation experiments.

DIM-WAM: World-Action Modeling with Diverse Historical Event Memory

cs.RO · 2026-06-26 · unverdicted · novelty 6.0

DiM-WAM is a memory-augmented world-action model that integrates multi-scale historical events and global task progress to improve long-horizon robot manipulation performance.

citing papers explorer

Showing 49 of 99 citing papers.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 62 · 2 links · internal anchor
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation cs.RO · 2026-05-08 · unverdicted · none · ref 8 · 2 links · internal anchor
Presents BioProVLA-Agent, a protocol-driven VLA-enabled multi-agent system for embodied biological manipulation with visual state verification and AugSmolVLA augmentation for robustness in wet-lab conditions.
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation cs.RO · 2026-05-07 · unverdicted · none · ref 49 · internal anchor
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
SpecPL: Disentangling Spectral Granularity for Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 59 · internal anchor
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
MolmoAct2: Action Reasoning Models for Real-world Deployment cs.RO · 2026-05-04 · unverdicted · none · ref 45 · 2 links · internal anchor
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations cs.AI · 2026-04-30 · unverdicted · none · ref 38 · internal anchor
PRTS pretrains VLA models with contrastive goal-conditioned RL to embed goal-reachability probabilities from offline data, yielding SOTA results on robotic benchmarks especially for long-horizon and novel instructions.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 24 · internal anchor
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 67 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps cs.RO · 2026-04-13 · unverdicted · none · ref 25 · internal anchor
AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 86 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model cs.RO · 2026-04-07 · unverdicted · none · ref 60 · internal anchor
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models cs.RO · 2026-04-05 · unverdicted · none · ref 48 · internal anchor
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
SABER: A Stealthy Agentic Black-Box Attack Framework for Vision-Language-Action Models cs.RO · 2026-03-26 · unverdicted · none · ref 39 · internal anchor
SABER uses a trained ReAct agent to produce bounded adversarial edits to robot instructions, cutting task success by 20.6% and increasing execution length and violations on the LIBERO benchmark across six VLA models.
Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning cs.RO · 2026-02-11 · unverdicted · none · ref 79 · internal anchor
LifeLong-RFT applies chunking-level on-policy reinforcement learning with Quantized Action Consistency Reward, Continuous Trajectory Alignment Reward, and Format Compliance Reward to fine-tune VLA models, achieving a 22% average success rate gain over supervised fine-tuning on the LIBERO benchmark's
AsyncVLA: Asynchronous Flow Matching for Vision-Language-Action Models cs.RO · 2025-11-18 · unverdicted · none · ref 80 · internal anchor
AsyncVLA adds asynchronous flow matching and a confidence rater to VLA models so they can generate actions on flexible schedules and selectively refine low-confidence tokens before execution.
Learning Action Priors for Cross-embodiment Robot Manipulation cs.RO · 2026-06-24 · unverdicted · none · ref 55 · internal anchor
A two-stage framework pretrains an action module with temporal motion priors from unconditioned trajectories using flow-matching, then transfers it to VLA training via decoder reuse and distillation, yielding better performance on cross-embodiment tasks.
World Value Models for Robotic Manipulation cs.RO · 2026-06-23 · unverdicted · none · ref 76 · internal anchor
World Value Model (WVM) integrates world models with value estimation to achieve SOTA Value-Order Correlation on expert and suboptimal robotic data and improves downstream policy performance.
TRACE: Trajectory-Routed Causal Memory for Delayed-Evidence Visuomotor Imitation cs.RO · 2026-06-12 · unverdicted · none · ref 15 · internal anchor
TRACE attaches a trajectory-signature-indexed latent memory to existing visuomotor imitation policies to recover context for delayed-evidence branch points in long-horizon robot tasks.
LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories cs.CL · 2026-06-11 · unverdicted · none · ref 41 · internal anchor
LabVLA uses RoboGenesis simulation data and a two-stage FAST pretraining plus flow matching recipe on a Qwen3-VL backbone to achieve the highest success rates on LabUtopia under in- and out-of-distribution conditions.
Real-Time Execution with Autoregressive Policies cs.RO · 2026-06-11 · unverdicted · none · ref 9 · internal anchor
Autoregressive VLA policies achieve real-time execution via tokenization horizon adjustment and constrained decoding, outperforming flow-matching policies in speed and performance across simulated and real environments.
HiMem-WAM: Hierarchical Memory-Gated World Action Models for Robotic Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 8 · internal anchor
HiMem-WAM integrates hierarchical latent actions and boundary-aware memory gates into world action models to enhance robustness and performance on memory-dependent long-horizon robotic tasks.
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing cs.RO · 2026-06-08 · unverdicted · none · ref 8 · internal anchor
AHA-WAM is a dual-DiT asynchronous world-action model with horizon-adaptive offset training and OVCR routing that reports 92.8% success on RoboTwin and 78.3% on real tasks at 24.17 Hz without robot pretraining.
EinSort: Sorting is All We Need for Tensorizing LLM cs.LG · 2026-06-07 · unverdicted · none · ref 101 · internal anchor
Sorting tensor indices enables an adaptive tensorization method that discovers low-rank structure in LLM weights and KV caches, yielding better reconstruction quality than baselines.
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding cs.CV · 2026-06-06 · unverdicted · none · ref 35 · internal anchor
Light-WAM is a lightweight world action model that performs latent-space video supervision and state-fusion action decoding to achieve usable multi-task robot performance with 0.44B parameters and low inference latency.
Dreaming when Necessary: Advancing World Action Models with Adaptive Multi-Modal Reasoning cs.RO · 2026-06-05 · unverdicted · none · ref 32 · internal anchor
AdaWAM introduces an adaptive router that triggers textual or visual reasoning as needed in world action models, claiming better efficiency and performance than prior embodied policies on simulated and real tasks.
World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis cs.RO · 2026-06-04 · unverdicted · none · ref 58 · internal anchor
WLA models use an autoregressive Transformer to jointly predict textual subtasks, subgoal images, and robot actions from instructions, images, and states, reporting SOTA success rates on RoboTwin2.0 and RMBench.
GeoSem-WAM: Geometry- and Semantic-Aware World Action Models cs.RO · 2026-06-02 · unverdicted · none · ref 41 · internal anchor
GeoSem-WAM adds geometric and semantic auxiliary prediction tasks to World Action Models during training to improve latent representations and action prediction accuracy while keeping inference efficient by avoiding explicit future rollouts.
DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation cs.RO · 2026-05-29 · unverdicted · none · ref 37 · internal anchor
DeMaVLA is a VLA foundation model using a pruned action expert and flow matching, pre-trained on 5000 hours of real demonstrations and post-trained on multi-task folding data with human-in-the-loop correction, reporting competitive benchmark and real-world folding performance.
Wall-OSS-0.5 Technical Report cs.RO · 2026-05-29 · unverdicted · none · ref 8 · internal anchor
Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.
VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models cs.RO · 2026-05-28 · unverdicted · none · ref 52 · internal anchor
VLA-Pro improves cross-task generalization in vision-language-action models by storing task-specific LoRA adapters as procedural memories and retrieving/fusing them at inference.
Rethinking VLM Representation for VLA Initialization cs.CV · 2026-05-25 · unverdicted · none · ref 37 · internal anchor
Experiments indicate original VLM representations are crucial for VLA performance, LoRA outperforms full finetuning, and staged robot-data pretraining yields the strongest initialization.
QuoVLA: Quotient Space for Vision-Language-Action Models cs.CV · 2026-05-24 · unverdicted · none · ref 35 · internal anchor
QuoVLA introduces a quotient-space framework that compresses VLM latents into action-sufficient representations via quantization and dual-branch design for better VLA generalization.
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction cs.RO · 2026-05-20 · unverdicted · none · ref 73 · internal anchor
PointACT proposes a 3D-aware dual-system VLA policy using multi-scale point-action interaction with bottleneck window self-attention, achieving 10% higher success rates on RLBench-10Tasks over prior pretrained VLAs.
VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models cs.RO · 2026-05-20 · conditional · none · ref 34 · internal anchor
VLA-REPLICA is a low-cost and reproducible real-world benchmark for evaluating VLA models in robotic manipulation tasks.
PhysBrain 1.0 Technical Report cs.RO · 2026-05-14 · unverdicted · none · ref 43 · internal anchor
PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation cs.RO · 2026-05-14 · unverdicted · none · ref 51 · internal anchor
IntentVLA conditions VLA chunk generation on a compact intent code from recent observations and introduces AliasBench to evaluate stability under short-horizon observation aliasing, reporting gains on multiple robot benchmarks.
Is the Future Compatible? Diagnosing Dynamic Consistency in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 43 · internal anchor
Action-state consistency in World Action Models distinguishes successful from failed imagined futures and supports value-free selection of better rollouts via consensus among predictions.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts cs.RO · 2026-05-07 · unverdicted · none · ref 34 · 2 links · internal anchor
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.
STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation cs.RO · 2026-04-29 · unverdicted · none · ref 41 · internal anchor
STARRY uses unified diffusion to align spatial-temporal world predictions with action generation plus GASAM for geometry-aware attention, reaching 93.82%/93.30% success on 50 bimanual tasks in simulation and raising real-world success from 42.5% to 70.8%.
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection cs.RO · 2026-04-15 · unverdicted · none · ref 24 · internal anchor
A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
vla-eval: A Unified Evaluation Harness for Vision-Language-Action Models cs.AI · 2026-03-14 · accept · none · ref 3 · internal anchor
vla-eval decouples VLA model inference from benchmark execution via WebSocket and Docker, supporting 14 benchmarks with up to 47x speedup and reproducing published scores across six codebases.
Causal World Modeling for Robot Control cs.CV · 2026-01-29 · unverdicted · none · ref 93 · internal anchor
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
Motus: A Unified Latent Action World Model cs.CV · 2025-12-15 · unverdicted · none · ref 60 · internal anchor
Motus unifies understanding, video generation, and action in one latent world model via MoT experts and optical-flow latent actions, reporting gains over prior methods in simulation and real robots.
XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations cs.RO · 2025-11-04 · unverdicted · none · ref 108 · internal anchor
XR-1 introduces Unified Vision-Motion Codes learned by dual-branch VQ-VAE and applies them in a three-stage training pipeline to outperform prior VLA models on 120+ real-world manipulation tasks across six robot embodiments.
RouterVLA: Turning Smoke Tests into Supervision for Heterogeneous VLA Selection cs.RO · 2026-06-25 · unverdicted · none · ref 33 · internal anchor
RouterVLA reports that a simple probe-success rule from outcome-separated smoke tests raises held-out VLA success by 14.64pp on 34,752 LIBERO-Plus records, with learned scorers adding no further gain.
Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model cs.CV · 2026-05-14 · unverdicted · none · ref 58 · internal anchor
Evo-Depth is a compact VLA model using a lightweight implicit depth encoder from RGB views plus progressive alignment to boost manipulation performance without added hardware.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models cs.RO · 2026-05-13 · unverdicted · none · ref 40 · 2 links · internal anchor
AttenA+ reweights action training objectives in VLA and WAM models via inverse velocity attention to prioritize kinematically critical segments, yielding small benchmark gains.
RLDX-1 Technical Report cs.RO · 2026-05-05 · unverdicted · none · ref 123 · 2 links · internal anchor
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot cs.RO · 2026-01-05 · unreviewed · ref 51 · internal anchor

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer