hub Canonical reference

Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432

Song, H · 2025 · arXiv 2505.21432

Canonical reference. 100% of citing Pith papers cite this work as background.

22 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

PearlVLA: Progressive Embodied Action-Plan Refinement in Latent Space

cs.RO · 2026-06-16 · unverdicted · novelty 7.0

PearlVLA achieves SOTA on LIBERO by separating VLM representations into visual grounding and an iterative latent plan branch refined via world model queries and RefineNet with process-reward RL.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

cs.RO · 2026-06-06 · unverdicted · novelty 7.0

Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

cs.RO · 2026-05-02 · unverdicted · novelty 7.0 · 2 refs

HDP3 is a pocket-scale 3D diffusion policy with a Diffusion Mixer decoder that achieves state-of-the-art visuomotor control using two-step DDIM inference and under 1% of the parameters of prior 3D diffusion policies.

Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving

cs.RO · 2026-03-14 · unverdicted · novelty 7.0

PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

cs.RO · 2026-03-10 · unverdicted · novelty 7.0

AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.

ROSA: A Robotics Foundation Model Serving System for Robot Factories

cs.RO · 2026-07-01 · unverdicted · novelty 6.0

ROSA introduces shared GPU-pool serving, robotics-aware abstractions for multi-model pipelines, and factory-productivity scheduling that improves output by up to 12.06x over dedicated per-robot systems.

E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

E-TTS introduces a plug-and-play test-time scaling method for embodied tasks that unifies reasoning-action sampling with history buffers and closed-loop refinement to improve performance on manipulation benchmarks.

UniFS: Unified Fast-to-Slow Hierarchical Architecture for Vision-Language-Action Models

cs.RO · 2026-06-22 · unverdicted · novelty 6.0

UniFS achieves 98.3% success on LIBERO with 2.1x lower latency than prior fast-slow VLA models by stratifying VLM layer update frequencies, inverting latent interactions, and applying multi-level supervision.

Recursive Self-Evolving Agents via Held-Out Selection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

RSEA adds a strict held-out keep-better gate to recursive self-evolution of agent artifacts, yielding monotone-safe gains or parity with the base ReAct agent on ALFWorld, GAIA, τ-bench, and WebShop.

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

cs.RO · 2026-05-02 · unverdicted · novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.

Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery

cs.RO · 2026-05-02 · unverdicted · novelty 6.0 · 2 refs

Sentinel-VLA adds metacognitive status monitoring to VLA models for on-demand reasoning and error recovery, reporting over 30% higher real-world task success than prior SOTA.

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

cs.RO · 2026-04-05 · unverdicted · novelty 6.0

Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.

UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

cs.CV · 2026-04-02 · conditional · novelty 6.0

UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.

Deep Image Clustering Based on Curriculum Learning and Density Information

cs.CV · 2026-03-31 · unverdicted · novelty 6.0

IDCL adds density-based curriculum learning and density-core guidance to deep image clustering, claiming superior robustness, faster convergence, and flexibility on benchmark datasets.

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

cs.RO · 2025-10-15 · unverdicted · novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions

cs.RO · 2025-09-08 · unverdicted · novelty 6.0

F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

cs.CV · 2025-07-06 · unverdicted · novelty 6.0

DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

cs.CV · 2025-05-23 · conditional · novelty 6.0

FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.

Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning

cs.RO · 2026-06-28 · conditional · novelty 5.0

VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.

FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation

cs.RO · 2026-06-24 · unverdicted · novelty 5.0

FORCE is a 3-stage RL fine-tuning method for VLA models that stabilizes Q-function via on-policy warm-up and filters high-value actions for updates, claiming 79% success rate gains and 32.5% faster training without human intervention.

VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation

cs.AI · 2026-02-07 · unverdicted · novelty 5.0

VGAS uses best-of-N selection with a geometrically grounded critic and explicit regularization to improve success rates of few-shot VLA policies under limited data and distribution shifts.

citing papers explorer

Showing 3 of 3 citing papers after filters.

UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models cs.CV · 2026-04-02 · conditional · none · ref 31
UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving cs.CV · 2025-05-23 · conditional · none · ref 55
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning cs.RO · 2026-06-28 · conditional · none · ref 58
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.

Hume: Introducing system- 2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer