{"total":31,"items":[{"citing_arxiv_id":"2606.17937","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ThinkingVLA: Interleaved Vision and Language Reasoning for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-16T13:45:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ThinkingVLA is a Mixture-of-Transformers VLA model that performs interleaved forward CoT for subgoal and image prediction followed by inverse CoT grounded on the predicted image to generate actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07895","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TBD-VLA: Temporal Block Diffusion Vision Language Action Model","primary_cat":"cs.CV","submitted_at":"2026-06-05T23:10:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TBD-VLA partitions action sequences into temporal blocks, performs masked discrete diffusion within blocks, and autoregressive generation across blocks to unify parallel decoding with temporal coherence in discrete VLA models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07170","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Trajectory Optimization for Autonomous Driving","primary_cat":"cs.RO","submitted_at":"2026-06-05T11:39:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TOAD applies test-time Cross-Entropy Method optimization to refine trajectories using the planner's scorer as a reward function, improving end-to-end autonomous driving performance without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06245","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action","primary_cat":"cs.RO","submitted_at":"2026-06-04T14:48:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MPCoT improves long-horizon VLA performance on LIBERO and CALVIN by initializing M latent hypotheses, refining them over K steps, and aggregating via a reward-trained path scorer while preserving the original 8-step action interface and generating zero reasoning tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05979","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis","primary_cat":"cs.RO","submitted_at":"2026-06-04T10:23:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"WLA models use an autoregressive Transformer to jointly predict textual subtasks, subgoal images, and robot actions from instructions, images, and states, reporting SOTA success rates on RoboTwin2.0 and RMBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03943","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PointAction: 3D Points as Universal Action Representations for Robot Control","primary_cat":"cs.RO","submitted_at":"2026-06-02T17:30:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PointAction uses predicted dynamic 3D pointmaps from fine-tuned video models as an embodiment-agnostic action representation to map video predictions to executable robot actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02735","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs","primary_cat":"cs.RO","submitted_at":"2026-06-01T18:02:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01241","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OneVLA: A Unified Framework for Embodied Tasks","primary_cat":"cs.RO","submitted_at":"2026-05-31T13:43:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OneVLA is a unified VLA model using a shared action head and multi-stage progressive training with CoT fine-tuning that reports state-of-the-art results on both navigation and manipulation in simulation and real-world settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28548","ref_index":70,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GEM: Generative Supervision Helps Embodied Intelligence","primary_cat":"cs.CV","submitted_at":"2026-05-27T14:39:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GEM adds generative depth supervision to VLM pre-training and reports improved results on embodied benchmarks plus real-world robot execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00110","ref_index":131,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:38:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GAM framework uses arc-length parameterization for temporal invariance and schema-affine factorization for geometric invariance to build a covariant action manifold integrated into VLA models for improved generalization from sparse data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25829","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-25T13:28:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OASIS improves robotic manipulation success and generalization by predicting camera-frame SE(3) end-effector trajectories to condition the action decoder on pose-supervised states.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23270","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ChainFlow-VLA: Causal Flow Planning with Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-22T06:17:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChainFlow-VLA unifies autoregressive causal trajectory modes with VLM-conditioned diffusion refinement to reach 94.85 on NAVSIM v1, matching human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19678","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoVLA: Multi-Consistency Constraints for Robust Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-19T11:10:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RoVLA enforces instructional, evolutionary, and observational consistency to improve robustness of VLA policies on manipulation benchmarks and real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15120","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CLOVER: Closed-Loop Value Estimation and Ranking for End-to-End Autonomous Driving Planning","primary_cat":"cs.RO","submitted_at":"2026-05-14T17:32:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CLOVER is a closed-loop generator-scorer framework that expands proposal coverage with pseudo-expert trajectories and performs conservative self-distillation to achieve state-of-the-art planning scores on NAVSIM and nuScenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14696","ref_index":75,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EponaV2: Driving World Model with Comprehensive Future Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-14T11:12:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Towards real-world-drive world models for autonomous driving. InEuropean conference on computer vision, pages 55-72. Springer, 2024. [74] Xinyang Wang, Qian Liu, Wenjie Ding, Zhao Yang, Wei Li, Chang Liu, Bailin Li, Kun Zhan, Xianpeng Lang, and Wei Chen. Unifying language-action understanding and generation for autonomous driving. arXiv preprint arXiv:2603.01441, 2026. [75] Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025. [76] Yin Wei, Zhang Chi, Chen Hao, Cai Zhipeng, Yu Gang, Wang Kaixuan, Chen Xiaozhi, and Shen Chunhua. Metric3D: Towards zero-shot metric 3D prediction from a single image."},{"citing_arxiv_id":"2605.13632","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T14:58:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"In this way, the policy preserves rich reasoning capacity without requiring autoregressive VLM decoding at every control step. 3.5 Data Construction and Training Recipe To train the framework at scale, we constructInteract-306K, a multi-embodiment dataset for guided spatial reasoning. As shown in Fig.3, it is built from approximately 306K real-world manipulation trajectories col- lected from Open X-Embodiment (OXE) [25], DROID [16], RoboMind [32], and our own data, and augmented with automatically generated spatial-reasoning annotations. Automated Spatial-CoT Supervision.Since raw robot demonstrations do not contain explicit reasoning traces, we automatically construct supervision for both theGuideandThinkphases. For each trajectory, we generate a structured reasoning target C= [C task, Cvision, Crobot], (10)"},{"citing_arxiv_id":"2605.06481","ref_index":79,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-07T16:06:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"VGGT: Visual geometry grounded transformer. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2503.11651. [78] Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. OmniTokenizer: A joint image-video tokenizer for visual generation. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.09399. [79] Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified Vision-Language-Action model.arXiv preprint arXiv:2506.19850, 2025. [80] Bowen Wen, Wei Yang, Jan Kautz, and Stan Birchfield. FoundationPose: Unified 6D pose estimation and tracking of novel objects. InIEEE/CVF Conference on Computer Vision and Pattern Recognition"},{"citing_arxiv_id":"2605.08215","ref_index":3,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Test-Time Training for Visual Foresight Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-06T11:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"T³VF applies test-time training on natural future-prediction supervision pairs with adaptive filtering to mitigate OOD shifts in VF-VLA models at modest extra inference cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03269","ref_index":105,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, 2023. [104] Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025a. [105] Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025b. [106] Chuan Wen, Jierui Lin, Trevor Darrell, Dinesh Jayaraman, and Yang Gao. Fighting copycat agents in behavioral cloning from observation histories. InAdvances in Neural Information Processing Systems, 2020."},{"citing_arxiv_id":"2605.00438","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Thinking in Text and Images: Interleaved Vision--Language Reasoning Traces for Long-Horizon Robot Manipulation","primary_cat":"cs.AI","submitted_at":"2026-05-01T06:15:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multimodal transformer generates and caches interleaved text-image traces to guide closed-loop actions, achieving 92.4% success on LIBERO-Long and 95.5% average on LIBERO.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the intended task sequence rather than only by short-term visual plausibility. Uniﬁed multimodal generation. Native multimodal models such as Chameleon, Transfusion, and Show-o/Show-o2 unify understanding and generation across text and images in a single trans- former [ 5, 32, 26, 27]. Robotics work has begun to adapt this idea to action generation, including UniVLA, dVLA, EO-1, and related uniﬁed policies [ 24, 25, 20]. The uniﬁed formulation is use- ful for IVLR because the same model can express language tokens, visual keyframes, and action- conditioned context without handing off between separate planners and controllers. Our contribution is not the uniﬁed backbone itself. We use such a backbone to study a speciﬁc robot reasoning rep- resentation: a full-horizon interleaved trace that is generated before execution and then cached for"},{"citing_arxiv_id":"2604.26694","ref_index":39,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising","primary_cat":"cs.RO","submitted_at":"2026-04-29T14:01:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Worldvla: Towards autoregressive action world model.CoRR, abs/2506.21539, 2025. [38] Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. Dreamvla: A vision-language-action model dreamed with comprehensive world knowledge.CoRR, abs/2507.04447, 2025. [39] Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxiang Zhang. Unified vision-language-action model.CoRR, abs/2506.19850, 2025. 12 [40] Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. VLA-JEPA: enhancing vision-language-action model with latent"},{"citing_arxiv_id":"2604.24622","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies","primary_cat":"cs.CV","submitted_at":"2026-04-27T15:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Other labels summarize auxiliary mechanisms such as Plan, WM, 3D spatial input, and Distill. Avg. Len. denotes the average sequence length. ∗ indicates results reproduced by us. Method Venue, Year Aux. Mech. NFE 1 2 3 4 5 Avg. Len. NFE = 1 methods DeeR [45] NeurIPS, 2024 - 1 85.3 69.6 54.9 42.0 31.2 2.83 LCD [46] ICLR, 2024 Plan 1 88.7 69.9 54.5 42.7 32.2 2.88 DySL-VLA [43] DAC, 2026 - 1 89.4 71.9 53.9 42.0 32.0 2.89 TaKSIE [13] WACV, 2025 Plan 1 90.4 73.9 61.7 51.2 40.8 3.18 HULC++ [26] ICRA, 2023 Plan 1 93.0 79.0 64.0 52.0 40.0 3.30 RoboTron-Mani [40] ICCV, 2025 3D input 1 94.7 80.3 65.1 51.4 39.0 3.31 DaDu-Corki-SW [9] ISCA, 2025 - 1 92.3 80.0 67.4 56.6 45.8 3.42 RoboUniView (default) [23] arXiv, 2024 - 195.4 82.768."},{"citing_arxiv_id":"2603.00110","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-18T14:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21998","ref_index":78,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Causal World Modeling for Robot Control","primary_cat":"cs.CV","submitted_at":"2026-01-29T17:07:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"9 84.4 SmolVLA [67] 93.0 94.0 91.0 77.0 88.8 CronusVLA [37] 97.3 99.6 96.9 94.0 97.0 FLOWER [62] 97.1 96.7 95.6 93.5 95.7 GR00T-N1 [6] 94.4 97.6 93.0 90.6 93.9 π0 [7] 96.8 98.8 95.8 85.2 94.1 π0+FAST [57] 96.4 96.8 88.6 60.2 85.5 OpenVLA [34] 84.7 88.4 79.2 53.7 76.5 OpenVLA-OFT [32] 97.6 98.497.994.5 97.1 DD-VLA [44] 97.2 98.6 97.4 92.0 96.3 UniVLA [78] 95.4 98.8 93.6 94.0 95.4 X-VLA [93] 98.2 98.6 97.8 97.6 98.1 LingBot-V A(Ours) 98.5±0.3 99.6±0.397.2±0.298.5±0.5 98.5 two manipulators, making it significantly more difficult for policy learning. We evaluate under bothEasy(fixed initial configurations) andHard(varied object poses and scene layouts) settings. As shown in Tab. 1, LingBot-V Aachieves"},{"citing_arxiv_id":"2512.09928","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-12-10T18:59:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiF-VLA improves long-horizon robotic manipulation by encoding past motion as hindsight priors and anticipating future motion through foresight reasoning inside a VLA framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18960","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention","primary_cat":"cs.LG","submitted_at":"2025-11-24T10:22:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AVA-VLA reformulates VLA learning as a POMDP using recurrent states and active visual attention to achieve state-of-the-art results on LIBERO, CALVIN, and real dual-arm tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.13778","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","primary_cat":"cs.RO","submitted_at":"2025-10-15T17:30:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.06951","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions","primary_cat":"cs.RO","submitted_at":"2025-09-08T17:58:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"F1 integrates next-scale visual foresight prediction into a Mixture-of-Transformer VLA architecture to reformulate action generation as foresight-guided inverse dynamics, achieving higher success rates on 136 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13073","ref_index":107,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2025-08-18T16:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5-7B AD (A), SFT (B) Introduce four open-ended tasks to expand interaction modalities. ReFineVLA [105] SigLIP Gemma2 AD (A), SFT (B) Propose reasoning-aware framework to fine-tune VLAs effectively. LoHoVLA [106] SigLIP Gemma-2B AD (A), SFT (B) Address long-horizon tasks via hierarchical closed-loop control. BridgeVLA [35] SigLIP Gemma PD (A), SFT (B) Project 3D data into 2D space for efficient action prediction UnifiedVLA [107] - Emu3 AD (A), SFT (B) Convert all input signals into tokens to build a unified model. WorldVLA [38] - Chameleon AD (A), SFT (B) Combine world and action models for bidirectional improvement. 4D-VLA [108] - InternVL-4B PD (A) Integrate 4D spatiotemporal cues for efficient VLA pretraining. VOTE [109] DINOv2 + SigLIP LLaMA2-7B PD (A) Introduce voting strategy to increase action prediction accuracy."},{"citing_arxiv_id":"2405.14093","ref_index":129,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"introduces an inpainting-based semantic augmentation method. RoboFlamingo [102] adapts the OpenFlamingo VLM to a robot policy by attaching an LSTM-based policy head. This demonstrates that pretrained VLMs can be effectively trans- ferred to language-conditioned robotic manipulation tasks. A recent trend in LLMs is equipping them with tool-use capabilities by generating code that calls tools via APIs [129]. Instruct2Act [105] follows this paradigm by integrating vision and action tools, enabling LLMs to perform robotic tasks. 3) Control Policies for Multimodal Instructions:Multi- modal instruction enables new ways to specify tasks, such as through demonstrations, by naming novel objects, or by pointing with a finger or mouse click. VIMA [130] places a significant emphasis on multimodal"}],"limit":50,"offset":0}