{"total":10,"items":[{"citing_arxiv_id":"2605.17486","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization","primary_cat":"cs.RO","submitted_at":"2026-05-17T14:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15298","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PhysBrain 1.0 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-14T18:11:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhysBrain 1.0 extracts scene elements, spatial dynamics, actions and depth relations from human egocentric video to create QA supervision for VLMs, then transfers the resulting physical priors to VLA policies via capability-preserving adaptation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01194","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model","primary_cat":"cs.RO","submitted_at":"2026-05-02T02:13:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01191","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery","primary_cat":"cs.RO","submitted_at":"2026-05-02T02:10:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00321","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-01T01:00:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25459","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GS-Playground: A High-Throughput Photorealistic Simulator for Vision-Informed Robot Learning","primary_cat":"cs.RO","submitted_at":"2026-04-28T10:05:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GS-Playground delivers a high-throughput photorealistic simulator for vision-informed robot learning via parallel physics integrated with batch 3D Gaussian Splatting at 10^4 FPS and an automated Real2Sim workflow for consistent environments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"J Zheng, Z Xiong, Y Wang, M Zhang, P Ma, et al. Genesis: A generative and universal physics engine for robotics and beyond.arXiv preprint arXiv:2401.01454, 2024. [62] Zhongyi Zhou, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Vision-language-action model with open- world embodied reasoning from pretrained knowledge. arXiv preprint arXiv:2505.21906, 2025. [63] Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Mart'ın-Mart'ın, Abhishek Joshi, Soroush Nasiriany, and Yifeng Zhu. robosuite: A modular simulation frame- work and benchmark for robot learning.arXiv preprint arXiv:2009.12293, 2020. [64] Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christo- pher Atkeson, Soeren Schwertfeger, Chelsea Finn, and Hang Zhao. Robot parkour learning."},{"citing_arxiv_id":"2602.20231","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-02-23T18:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13073","ref_index":131,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2025-08-18T16:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Fast-in-Slow [40] Prismatic Diff., AR Propose a unified dual-system model that embeds fast execution within a VLM-based reasoner. OpenHelix [58] LLaVA Diff. Conduct auxiliary training on the token bridging VLM and policy. ChatVLA [130] Qwen2-VL Diff. Unifie vision-language-action via MoE-shared attention with separate perception/control FFNs. ChatVLA-2 [131] Qwen2-VL Diff. Enable open-world robotic reasoning via dynamic MoE routing and Reasoning-Following MLP . Diffusion-VLA [132] Qwen2-VL Diff. Merge Qwen2-VL reasoning with diffusion actions via FiLM-modulated reasoning injection. TriVLA [133] Eagle-2 Diff. Introduce a world-dynamics perception module as system 3 to complement static perception. GF-VLA [134] LLaMA 2 Regression Enable interpretable bimanual manipulation via information-theoretic graphs from human videos."},{"citing_arxiv_id":"2503.03480","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning","primary_cat":"cs.RO","submitted_at":"2025-03-05T13:16:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.14058","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Matters in Building Vision-Language-Action Models for Generalist Robots","primary_cat":"cs.RO","submitted_at":"2024-12-18T17:07:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Systematic tests of VLM backbones, policy architectures, and cross-embodiment data yield RoboVLMs that set new SOTA on robot manipulation benchmarks while requiring few manual designs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}