hub Canonical reference

Reconvla: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, Haoang Li · 2025 · arXiv 2508.10333

Canonical reference. 100% of citing Pith papers cite this work as background.

13 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete

cs.RO · 2026-05-12 · unverdicted · novelty 7.0

Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.

Mask World Model: Predicting What Matters for Robust Robot Policy Learning

cs.RO · 2026-04-21 · unverdicted · novelty 7.0

Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.

Towards Generalizable Robotic Manipulation in Dynamic Environments

cs.CV · 2026-03-16 · unverdicted · novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

NTR adds a self-distillation masked latent reconstruction objective that uses only scene tokens to reconstruct masked patch features, improving visual representation quality and planning performance in end-to-end autonomous driving.

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

cs.RO · 2026-04-27 · unverdicted · novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors

cs.RO · 2026-04-23 · unverdicted · novelty 6.0

CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.

Grounded World Model for Semantically Generalizable Planning

cs.RO · 2026-04-13 · conditional · novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

cs.RO · 2026-02-22 · unverdicted · novelty 6.0

OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.

GEM: Generative Supervision Helps Embodied Intelligence

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

cs.CV · 2026-04-15 · unverdicted · novelty 5.0 · 2 refs

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

cs.LG · 2025-11-24 · unverdicted · novelty 5.0

AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.

From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model

cs.CV · 2026-05-21

citing papers explorer

Showing 13 of 13 citing papers.

Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete cs.RO · 2026-05-12 · unverdicted · none · ref 19
Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.
CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models cs.CV · 2026-05-11 · unverdicted · none · ref 14
Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.
Mask World Model: Predicting What Matters for Robust Robot Policy Learning cs.RO · 2026-04-21 · unverdicted · none · ref 32
Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.
Towards Generalizable Robotic Manipulation in Dynamic Environments cs.CV · 2026-03-16 · unverdicted · none · ref 49
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving cs.CV · 2026-05-29 · unverdicted · none · ref 38
NTR adds a self-distillation masked latent reconstruction objective that uses only scene tokens to reconstruct masked patch features, improving visual representation quality and planning performance in end-to-end autonomous driving.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills cs.RO · 2026-04-27 · unverdicted · none · ref 41
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors cs.RO · 2026-04-23 · unverdicted · none · ref 15
CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.
Grounded World Model for Semantically Generalizable Planning cs.RO · 2026-04-13 · conditional · none · ref 51
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation cs.RO · 2026-02-22 · unverdicted · none · ref 44
OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.
GEM: Generative Supervision Helps Embodied Intelligence cs.CV · 2026-05-27 · unverdicted · none · ref 64
GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System cs.CV · 2026-04-15 · unverdicted · none · ref 33 · 2 links
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention cs.LG · 2025-11-24 · unverdicted · none · ref 40
AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model cs.CV · 2026-05-21 · unreviewed · ref 23

Reconvla: Reconstructive vision-language-action model as effective robot perceiver.arXiv preprint arXiv:2508.10333

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer