hub

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine · 2025 · cs.LG · arXiv 2506.01844

48 Pith papers cite this work. Polarity classification is still indexing.

48 Pith papers citing it

open full Pith review browse 48 citing papers arXiv PDF

abstract

Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms. In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance. SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs. To further improve responsiveness, we introduce an asynchronous inference stack decoupling perception and action prediction from action execution, allowing higher control rates with chunked action generation. Despite its compact size, SmolVLA achieves performance comparable to VLAs that are 10x larger. We evaluate SmolVLA on a range of both simulated as well as real-world robotic benchmarks and release all code, pretrained models, and training data.

hub tools

JSON dossier citing papers JSON arXiv source

claims ledger

abstract Vision-language models (VLMs) pretrained on large-scale multimodal datasets encode rich visual and linguistic knowledge, making them a strong foundation for robotics. Rather than training robotic policies from scratch, recent approaches adapt VLMs into vision-language-action (VLA) models that enable natural language-driven perception and control. However, existing VLAs are typically massive--often with billions of parameters--leading to high training costs and limited real-world deployability. Moreover, they rely on academic and industrial datasets, overlooking the growing availability of comm

co-cited works

representative citing papers

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

Test-time Sparsity for Extreme Fast Action Diffusion

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.

LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.

Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

cs.RO · 2026-05-09 · unverdicted · novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.

NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

cs.RO · 2026-05-08 · unverdicted · novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

cs.RO · 2026-05-07 · unverdicted · novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

cs.RO · 2026-04-27 · unverdicted · novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.

See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

cs.RO · 2026-05-12 · unverdicted · novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action VLA models.

PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.

Unified Noise Steering for Efficient Human-Guided VLA Adaptation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

cs.RO · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

cs.RO · 2026-05-08 · unverdicted · novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

When to Trust Imagination: Adaptive Action Execution for World Action Models

cs.RO · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

cs.RO · 2026-05-06 · unverdicted · novelty 6.0

ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

MolmoAct2: Action Reasoning Models for Real-world Deployment

cs.RO · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

cs.RO · 2026-05-02 · unverdicted · novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.

MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation

cs.RO · 2026-05-01 · unverdicted · novelty 6.0

MSACT improves localization stability and task success rates in limited-data bimanual manipulation by extracting stable 2D attention points and aligning predicted attention sequences across frames without keypoint labels.

Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances

cs.RO · 2026-05-01 · unverdicted · novelty 6.0

A stereo multistage spatial attention deep predictive learning system improves robustness and success rates for real-time mobile manipulation under visual scale variation and disturbances.

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

cs.RO · 2026-04-27 · unverdicted · novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

citing papers explorer

Showing 48 of 48 citing papers.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs cs.RO · 2026-05-13 · unverdicted · none · ref 22 · internal anchor
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
Test-time Sparsity for Extreme Fast Action Diffusion cs.CV · 2026-05-13 · unverdicted · none · ref 24 · internal anchor
Test-time sparsity with a parallel pipeline and omnidirectional feature reuse accelerates action diffusion by 5x to 47.5 Hz while cutting FLOPs 92% with no performance loss.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation cs.RO · 2026-05-12 · unverdicted · none · ref 42 · internal anchor
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete cs.RO · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.
Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games? cs.AI · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
VLATIM benchmark reveals large VLMs excel at high-level planning in physics puzzles but struggle with precise visual grounding and mouse control, so they lack human-like problem-solving capabilities.
LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models cs.AI · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
LoopVLA adds recurrent refinement and learned sufficiency estimation to VLA models, cutting parameters 45% and raising throughput 1.7x while matching baseline task success on LIBERO and VLA-Arena.
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models cs.RO · 2026-05-09 · unverdicted · none · ref 14 · internal anchor
GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models cs.RO · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation cs.RO · 2026-05-07 · unverdicted · none · ref 69 · internal anchor
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment cs.RO · 2026-04-27 · unverdicted · none · ref 20 · internal anchor
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.
See What Matters: Differentiable Grid Sample Pruning for Generalizable Vision-Language-Action Model cs.RO · 2026-05-12 · unverdicted · none · ref 20 · internal anchor
GridS reduces visual tokens in VLA models to under 10% of the original count via task-aware differentiable resampling, delivering 76% lower FLOPs with no drop in task success rate on benchmarks and real robots.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models cs.RO · 2026-05-12 · unverdicted · none · ref 35 · internal anchor
Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action VLA models.
PriorVLA: Prior-Preserving Adaptation for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 37 · internal anchor
PriorVLA preserves pretrained priors in VLA models through a frozen Prior Expert and trained Adaptation Expert, delivering better robot manipulation performance than full fine-tuning with only 25% of the parameter updates.
Unified Noise Steering for Efficient Human-Guided VLA Adaptation cs.RO · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models cs.RO · 2026-05-11 · unverdicted · none · ref 40 · 2 links · internal anchor
ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy cs.CV · 2026-05-08 · unverdicted · none · ref 37 · 2 links · internal anchor
Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation cs.RO · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
When to Trust Imagination: Adaptive Action Execution for World Action Models cs.RO · 2026-05-07 · unverdicted · none · ref 21 · 2 links · internal anchor
A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation cs.RO · 2026-05-06 · unverdicted · none · ref 65 · internal anchor
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
MolmoAct2: Action Reasoning Models for Real-world Deployment cs.RO · 2026-05-04 · unverdicted · none · ref 32 · 2 links · internal anchor
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture changes for lower latency.
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model cs.RO · 2026-05-02 · unverdicted · none · ref 19 · internal anchor
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation cs.RO · 2026-05-01 · unverdicted · none · ref 4 · internal anchor
MSACT improves localization stability and task success rates in limited-data bimanual manipulation by extracting stable 2D attention points and aligning predicted attention sequences across frames without keypoint labels.
Stereo Multistage Spatial Attention for Real-Time Mobile Manipulation Under Visual Scale Variation and Disturbances cs.RO · 2026-05-01 · unverdicted · none · ref 18 · internal anchor
A stereo multistage spatial attention deep predictive learning system improves robustness and success rates for real-time mobile manipulation under visual scale variation and disturbances.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills cs.RO · 2026-04-27 · unverdicted · none · ref 39 · internal anchor
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
AsyncShield: A Plug-and-Play Edge Adapter for Asynchronous Cloud-based VLA Navigation cs.RO · 2026-04-27 · unverdicted · none · ref 23 · internal anchor
AsyncShield restores VLA geometric intent from latency via kinematic pose mapping and uses PPO-Lagrangian to balance tracking with LiDAR safety constraints in a plug-and-play module.
LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios cs.RO · 2026-04-24 · unverdicted · none · ref 26 · internal anchor
LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors cs.RO · 2026-04-23 · unverdicted · none · ref 16 · internal anchor
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks cs.RO · 2026-04-18 · unverdicted · none · ref 13 · internal anchor
LongBench is a new real-world benchmark that separates execution robustness from context-dependent reasoning in long-horizon robotic manipulation and shows these are distinct challenges not uniformly solved by memory-based policies.
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation cs.RO · 2026-04-17 · unverdicted · none · ref 46 · internal anchor
Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
Long-Term Memory for VLA-based Agents in Open-World Task Execution cs.RO · 2026-04-17 · unverdicted · none · ref 9 · internal anchor
ChemBot adds dual-layer memory and future-state asynchronous inference to VLA models, enabling better long-horizon success in chemical lab automation on collaborative robots.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 50 · internal anchor
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Device-Conditioned Neural Architecture Search for Efficient Robotic Manipulation cs.RO · 2026-04-11 · unverdicted · none · ref 27 · internal anchor
DC-QFA trains one supernet over architectures and bit-widths, then runs a fast per-device search plus multi-step distillation to deliver 2-3x faster robotic policies across hardware with negligible success-rate drop.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis cs.RO · 2026-04-10 · unverdicted · none · ref 55 · internal anchor
VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
A1: A Fully Transparent Open-Source, Adaptive and Efficient Truncated Vision-Language-Action Model cs.RO · 2026-04-07 · unverdicted · none · ref 32 · internal anchor
A1 is a transparent VLA framework achieving state-of-the-art robot manipulation success with up to 72% lower latency via adaptive layer truncation and inter-layer flow matching.
SnapFlow: One-Step Action Generation for Flow-Matching VLAs via Progressive Self-Distillation cs.CV · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
SnapFlow compresses multi-step denoising in flow-matching VLAs into one step via progressive self-distillation using two-step Euler shortcuts from marginal velocities, matching 10-step teacher success rates with 9.6x speedup on pi0.5.
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment cs.RO · 2026-04-07 · unverdicted · none · ref 46 · internal anchor
A contrastive alignment model plus offline preference learning explicitly grounds hierarchical VLA language descriptions to actions and visuals on LanguageTable, achieving performance comparable to fully supervised fine-tuning while reducing annotation needs.
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes cs.CV · 2026-04-06 · conditional · none · ref 50 · internal anchor
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models cs.RO · 2026-04-05 · unverdicted · none · ref 33 · internal anchor
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
Fast-WAM: Do World Action Models Need Test-time Future Imagination? cs.CV · 2026-03-17 · unverdicted · none · ref 18 · internal anchor
Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 35 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
AttenA+: Rectifying Action Inequality in Robotic Foundation Models cs.RO · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
AttenA+ applies velocity-driven action attention to reweight training objectives toward kinematically critical low-velocity segments, yielding small benchmark gains on Libero and RoboTwin without added parameters.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 93 · internal anchor
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation cs.RO · 2026-05-08 · unverdicted · none · ref 9 · internal anchor
BioProVLA-Agent integrates protocol parsing, visual state verification, and VLA-based execution in a closed-loop multi-agent framework with AugSmolVLA augmentation to improve robustness for biological lab tasks like tube handling and liquid pouring.
Understanding Asynchronous Inference Methods for Vision-Language-Action Models cs.RO · 2026-05-04 · unverdicted · none · ref 5 · internal anchor
Controlled benchmarks show per-step residual correction (A2C2) as most effective for VLA asynchronous inference up to d=8 delays on Kinetix with over 90% solve rate, outperforming inpainting and conditioning while training-based simulation is most robust.
PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance cs.RO · 2026-04-22 · unverdicted · none · ref 29 · internal anchor
PokeVLA is a lightweight VLA model pre-trained on 2.4M samples for spatial grounding and reasoning, then adapted via multi-view semantics and geometry alignment to achieve state-of-the-art robot manipulation performance.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models cs.RO · 2026-04-21 · unverdicted · none · ref 62 · internal anchor
VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
Causal World Modeling for Robot Control cs.CV · 2026-01-29 · unverdicted · none · ref 67 · internal anchor
LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.
SEVO: Semantic-Enhanced Virtual Observation for Robust VLA Manipulation via Active Illumination and Data-Centric Collection cs.RO · 2026-05-11 · conditional · none · ref 3 · internal anchor
SEVO raises ACT and SmolVLA pick-and-place success from 30-35% to 75-85% in novel environments by using active illumination, semantic cues, and diversified teleoperation data.

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

hub tools

claims ledger

co-cited works

fields

years

verdicts

representative citing papers

citing papers explorer