{"total":13,"items":[{"citing_arxiv_id":"2605.31116","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving","primary_cat":"cs.CV","submitted_at":"2026-05-29T10:27:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NTR adds a self-distillation masked latent reconstruction objective that uses only scene tokens to reconstruct masked patch features, improving visual representation quality and planning performance in end-to-end autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28548","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GEM: Generative Supervision Helps Embodied Intelligence","primary_cat":"cs.CV","submitted_at":"2026-05-27T14:39:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22671","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2026-05-21T16:14:19+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12160","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Premover: Fast Vision-Language-Action Control by Acting Before Instructions Are Complete","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:10:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Premover enables VLA policies to act on partial instructions by precomputing focus maps from intermediate backbone layers, reducing wall-clock time 13.6 percent on LIBERO while preserving 95 percent success rate.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"salient but instruction-irrelevant regions. Several VLA-specific efforts have addressed this gap. Some modify the VLA itself: Knowledge Insulation [6] preserves the VLM backbone's pretrained vision-language knowledge by blocking action-expert gradients during training; RoboGround [ 7] uses a separately fine-tuned grounded VLM to produce target masks for the policy; and Recon- VLA [19] adds a gaze-region reconstruction objective. Others operate on frozen VLAs with auxiliary modules: V AP [10] equips a frozen VLA with selective attention using open-vocabulary detection over reference images, while PVI [23] injects auxiliary visual representations into a frozen action expert through residual pathways. These methods are designed for finalized instructions."},{"citing_arxiv_id":"2605.10903","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-11T17:41:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Capability vectors extracted from parameter differences between standard and auxiliary-finetuned VLA models can be merged into pretrained weights to match auxiliary-training performance while reducing computational overhead during adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24182","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills","primary_cat":"cs.RO","submitted_at":"2026-04-27T08:44:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AgileX PiPer 6-DoF robotic manipulator equipped with a parallel gripper, resulting in a 7-DoF system (illustrated in Fig. 5). Visual observations are acquired via 2 RGB cameras: a static front-view camera (main camera) and a wrist-mounted camera (wrist camera). Evaluation Tasks. To assess real-world performance and generalization capabilities, inspired by recent works [41], [42], we design four categories of evaluation tasks: (1)pick the apple and place it in the basket, (2)pour water into the bowl, (3)pick with instruction following, and (4)pick novel objects. (1) and (2) are fundamental tasks, while (3) and (4) are generalization tasks. Specifically, the (3)pick with instruction followingtask evaluates language generalization"},{"citing_arxiv_id":"2604.21241","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","primary_cat":"cs.RO","submitted_at":"2026-04-23T03:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Among candidate intermediates, spatial cues are partic- ularly prominent. A broad line of work seeks to represent \"what should change\" in the scene-often through future- oriented or change-focused modeling-and use it to sup- port action generation. For instance, CoTVLA [13] and DreamVLA [14] highlight the utility of emphasizing re- gions of change, and ReconVLA [15] explores predicting future observations to inform long-horizon behavior. These approaches encode spatial guidance in visual or latent forms and inject it through representation learning. Motivated by the same goal of leveraging spatial structure, we explore a complementary route: can spatial guidance be expressed as direct, text-stylephysical quantities that align more closely"},{"citing_arxiv_id":"2604.19683","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Mask World Model: Predicting What Matters for Robust Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:05:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Mask World Model predicts semantic mask dynamics with video diffusion and integrates it with a diffusion policy head, outperforming RGB world models on LIBERO and RLBench while showing better real-world generalization and texture robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14125","ref_index":33,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","primary_cat":"cs.CV","submitted_at":"2026-04-15T17:50:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A critical challenge in manipulation is precise visual grounding, which accurately maps high-level instructions to specific spatial regions within the visual input. Early visual-centric VLAs, such asπ 0.5 [17] and InternVLA-M1 [11], address this by leveraging strong vision-language alignment for spatial localization. To further enforce visual attention, recent works explore integrated grounding tech- niques. ReconVLA [33] introduces an implicit paradigm that forces a diffusion transformer to reconstruct target gaze regions from visual outputs. Similarly, ap- proaches like InterleaveVLA [14] and 3D-CAVLA [3] attempt to improve scene awareness by interleaving visual tokens with language or incorporating chain- of-thought region detection. However, these integrated methods lack explicit ar-"},{"citing_arxiv_id":"2604.11751","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Grounded World Model for Semantically Generalizable Planning","primary_cat":"cs.RO","submitted_at":"2026-04-13T17:25:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Subsequently, the trajectory yielding the minimum cost is executed in the environment. To capture sufficient dynamic and semantic details, modern world models are usually trained with videos featuring realistic physics. During training, the current and future states are represented in either pixel space [7, 24, 61] or latent space [1, 63, 20]. Latent world models, such as DINO-WM [63] and JEPA-WM [51], have shown great potential for visuomotor planning, as they circumvent computationally expensive pixel reconstruction. For latent world models where state transition is defined in the latent space, the score function used for MPC is usually Mean Squared Error (MSE) between the embedding of each predicted future and that of the goal image. However,"},{"citing_arxiv_id":"2603.15620","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Generalizable Robotic Manipulation in Dynamic Environments","primary_cat":"cs.CV","submitted_at":"2026-03-16T17:59:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20200","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-22T15:39:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OptimusVLA augments hierarchical VLA models with Global Prior Memory for shorter generative paths and Local Consistency Memory for temporal coherence, yielding higher success rates and 2.9x faster inference on simulation and real-world robotic benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18960","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention","primary_cat":"cs.LG","submitted_at":"2025-11-24T10:22:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}