{"total":15,"items":[{"citing_arxiv_id":"2606.26006","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FORCE: Efficient VLA Reinforcement Fine-Tuning via Value-Calibrated Warm-up and Self-Distillation","primary_cat":"cs.RO","submitted_at":"2026-06-24T16:23:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FORCE is a 3-stage RL fine-tuning method for VLA models that stabilizes Q-function via on-policy warm-up and filters high-value actions for updates, claiming 79% success rate gains and 32.5% faster training without human intervention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12167","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T14:15:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01581","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control","primary_cat":"cs.RO","submitted_at":"2026-05-02T19:07:09+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[27] Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. InConference on Robot Learning, pages 785-799. PMLR, 2023. [28] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. InAdvances in Neural Information Processing Systems, volume 28, 2015. [29] Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system-2 thinking in visual-language- action model.arXiv preprint arXiv:2505.21432, 2025. [30] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020."},{"citing_arxiv_id":"2605.01194","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model","primary_cat":"cs.RO","submitted_at":"2026-05-02T02:13:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01191","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery","primary_cat":"cs.RO","submitted_at":"2026-05-02T02:10:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04161","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Action Chunking at Inference-time for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-05T16:03:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mesh Garg, and Valts Blukis. OG-VLA: 3D-aware vision language action model via orthographic image generation. arXiv preprint arXiv:2506.01196, 2025. 1 [35] Junhyuk So, Chiwoong Lee, Shinyoung Lee, et al. Improv- ing generative behavior cloning via self-guidance and adap- tive chunking. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1 [36] Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, et al. Hume: Introducing system-2 thinking in visual-language- action model.arXiv preprint arXiv:2505.21432, 2025. 2 [37] Wenxuan Song, Jiayi Chen, Pengxiang Ding, Han Zhao, Wei Zhao, Zhide Zhong, Zongyuan Ge, Jun Ma, and Haoang Li. Accelerating vision-language-action model integrated"},{"citing_arxiv_id":"2604.02241","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-04-02T16:33:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"knowledge to robotic control, 2023. [30] Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, Jianlan Luo, You Liang Tan, Lawrence Yunliang Chen, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy, 2024. [31] Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432, 2025. [32] ByungOk Han, Jaehong Kim, and Jinhyeok Jang. A dual process vla: Efficient robotic manipulation leveraging vlm, 2024."},{"citing_arxiv_id":"2604.03306","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Image Clustering Based on Curriculum Learning and Density Information","primary_cat":"cs.CV","submitted_at":"2026-03-31T02:54:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IDCL adds density-based curriculum learning and density-core guidance to deep image clustering, claiming superior robustness, faster convergence, and flexibility on benchmark datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"https://arxiv. org/abs/1708.07747 [73] Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016. Unsupervised Deep Em- bedding for Clustering Analysis. In Proceedings of The 33rd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol.48), Maria Florina Balcan and Kilian Q. Weinberger (Eds.). PMLR, New York, New York, USA, 478-487. [74] Haoran Xu, Jiacong Hu, Ke Zhang, Lei Yu, Yuxin Tang, Xinyuan Song, Yiqun Duan, Lynn Ai, and Bill Shi. 2025. Sedm: Scalable self-evolving distributed mem- ory for agents. arXiv preprint arXiv:2509.09498 (2025). [75] Teng Yan, Yuxiang Sun, Yang Zhang, Zhenxi Yu, Wenxian Li, and Kailiang Zhang. 2023. Stability analysis of 3c electronic industry robot grasping based on visual-"},{"citing_arxiv_id":"2603.13842","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fine-tuning is Not Enough: A Parallel Framework for Collaborative Imitation and Reinforcement Learning in End-to-end Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-03-14T08:53:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PaIR-Drive runs IL and RL in parallel branches with a tree-structured sampler to reach 91.2 PDMS and 87.9 EPDMS on NAVSIM benchmarks while outperforming sequential RL fine-tuning and correcting some human errors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.10126","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-03-10T18:03:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AR-VLA introduces a standalone autoregressive action expert with long-lived memory that generates context-aware continuous actions for VLAs, replacing chunk-based heads with smoother trajectories and maintained task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.07399","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation","primary_cat":"cs.AI","submitted_at":"2026-02-07T06:31:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VGAS uses best-of-N selection with a geometrically grounded critic and explicit regularization to improve success rates of few-shot VLA policies under limited data and distribution shifts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.13778","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","primary_cat":"cs.RO","submitted_at":"2025-10-15T17:30:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.06951","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions","primary_cat":"cs.RO","submitted_at":"2025-09-08T17:58:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.04447","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge","primary_cat":"cs.CV","submitted_at":"2025-07-06T16:14:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DreamVLA uses dynamic-region-guided world knowledge prediction, block-wise attention to disentangle information types, and a diffusion transformer for actions, reaching 76.7% success on real robot tasks and 4.44 average length on CALVIN ABC-D.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"language-action model with open-world generalization. arXiv preprint arXiv:2504.16054 , 2025. 3 [88] Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xu Huang, Shu Jiang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025. 3 [89] Haoming Song, Delin Qu, Yuanqi Yao, Qizhi Chen, Qi Lv, Yiwen Tang, Modi Shi, Guanghui Ren, Maoqing Yao, Bin Zhao, et al. Hume: Introducing system-2 thinking in visual-language- action model. arXiv preprint arXiv:2505.21432, 2025. 3 [90] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song."},{"citing_arxiv_id":"2505.17685","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2025-05-23T09:55:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}