PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.
hub Canonical reference
Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.
MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.
PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.
AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.
ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.
RSEA adds a strict held-out keep-better gate to recursive self-evolution of agent artifacts, yielding monotone-safe gains or parity with the base ReAct agent on ALFWorld, GAIA, τ-bench, and WebShop.
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
Sentinel-VLA adds metacognitive status monitoring to VLA models for on-demand reasoning and error recovery, reporting over 30% higher real-world task success than prior SOTA.
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.
IDCL adds density-based curriculum learning and density-core guidance to deep image clustering, claiming superior robustness, faster convergence, and flexibility on benchmark datasets.
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.
DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.
FORCE is a 3-stage RL fine-tuning method for VLA models that stabilizes Q-function via on-policy warm-up and filters high-value actions for updates, claiming 79% success rate gains and 32.5% faster training without human intervention.
VGAS uses best-of-N selection with a geometrically grounded critic and explicit regularization to improve success rates of few-shot VLA policies under limited data and distribution shifts.
citing papers explorer
-
UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.
-
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
-
Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.