{"total":27,"items":[{"citing_arxiv_id":"2605.23856","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15735","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UAM: A Dual-Stream Perspective on Forgetting in VLA Training","primary_cat":"cs.CV","submitted_at":"2026-05-15T08:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UAM adds a Dorsal Expert initialized from a generative model and trained on visual dynamics prediction to preserve over 95% of VLM multimodal ability in VLA training while achieving top success rates on manipulation tasks including OOD cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16412","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SCAR: Self-Supervised Continuous Action Representation Learning","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:23:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13925","ref_index":114,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards Robotic Dexterous Hand Intelligence: A Survey","primary_cat":"cs.RO","submitted_at":"2026-05-13T15:23:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A structured survey of dexterous robotic hand research that reviews hardware, control methods, data resources, and benchmarks while identifying major limitations and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"data sources by incorporating Internet-scale human interaction data into the VLA framework [111]. Recent studies extend VLA toward stronger geometric grounding and hierarchical control, which are critical for in-hand manipulation. 3D-VLA [112] augments 2D visual inputs with 3D scene representations, enabling reasoning about hand-object geometry and contact. RoboDexVLM [113] and Villa-X [114] decompose long-horizon instructions into structured subgoals, marking a shift from end-to-end language policies toward hierarchical architectures that interface more naturally with low-level controllers. DexVLA [115] introduces plug-in diffusion experts to handle multimodal coordination during reorientation, while OmniVLA [116] unifies visual, linguistic, and haptic inputs within a single Transformer ar-"},{"citing_arxiv_id":"2605.13452","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CUBic: Coordinated Unified Bimanual Perception and Control Framework","primary_cat":"cs.RO","submitted_at":"2026-05-13T12:48:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CUBic learns a shared tokenized representation for bimanual robot perception and control via unidirectional aggregation, bidirectional codebook coordination, and a unified diffusion policy, yielding higher coordination accuracy and task success on the RoboTwin benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13403","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RotVLA: Rotational Latent Action for Vision-Language-Action Model","primary_cat":"cs.RO","submitted_at":"2026-05-13T11:58:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"a unified action space across heterogeneous datasets by encoding the transition between sequential observations. Mainstream approaches [6, 27-31] typically adopt an Inverse Dynamics Model (IDM) 3 to infer latent actions from consecutive video frames, coupled with a Forward Dynamics Model (FDM) that reconstructs future observations conditioned on the inferred latent action. Building on this formulation, [32, 33] incorporate annotated actions to guide latent-action learning, while [22, 34] treat latent actions as surrogate labels for unlabeled data. Other works enhance these models by introducing additional modalities such as depth or optical flow [34-37]. Despite their promise, these pipelines have the risk of degenerating into trivial solutions that simply"},{"citing_arxiv_id":"2605.12334","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcing VLAs in Task-Agnostic World Models","primary_cat":"cs.AI","submitted_at":"2026-05-12T16:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAW-Dream disentangles world-model learning from task data by using a pre-trained task-agnostic world model and VLM rewards, with dual-noise filtering, to enable zero-shot VLA adaptation in simulation and real settings.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024. [4] Chandra, A.L. et al. Diwa: Diffusion policy adaptation with world models.arXiv preprint arXiv:2508.03645, 2025. [5] Chen, K. et al. πRL: Online rl fine-tuning for flow-based vision-language-action models.arXiv preprint arXiv: 2510.25889, 2025. [6] Chen, X. et al. Villa-x: enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv:2507.23682, 2025. [7] Collaboration, O.X.E. et al. Open X-Embodiment: Robotic learning datasets and RT-X models. https://arxiv.org/abs/2310.08864, 2023. [8] Guo, Y . et al. Vlaw: Iterative co-improvement of vision-language-action policy and world"},{"citing_arxiv_id":"2605.12090","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Say ,Dream,and Act [10], Gen2Act [68], A VDC [8], Im2Flow2Act [69], 3DFlowAction [70] NovaFlow [71], Dream2Flow [72], Dreamitate [ 73], 4DGen [ 74], RIGVid [75], L VP [76] Vidar [77], Veo-Act [78], pi0.7 [ 79], V AG [80] Implicit VPP [11], VILP [ 81], Video Policy [13], ARDuP [ 82], mimic-video [ 12], LAP A [15], villa-X [ 83], S-V AM [14], OmniVTA [84], MWM [85] Joint W AM Autoregression GR1 [86], grmg [ 87], GR2 [88], Co TVLA [89], WorldVLA [90], rynnvla2 [91] VLA-JEP A [92], F1-VLA [93] Diffusion-based P AD [21], VideoVLA [94], UWM [20], DreamZero [ 17], CosmosPolicy [16], FLARE [95], UV A [96] FRAPPE [97], CoV AR [98], LDA1B [99], W A V [100], DUST [101], LingBotV A [18], AIM [ 102] DexWorldModel [103], FastW AM [104], MotuBrain [105] AdaWorldPolicy [106], DiT4DiT [107],"},{"citing_arxiv_id":"2605.10821","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Unified Noise Steering for Efficient Human-Guided VLA Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"into language-model-style sequence modeling [1-4, 10, 13, 18]. Other methods attach continuous action heads or diffusion-style decoders to vision-language models for better representation of high-dimensional continuous control [ 5, 17, 45, 46]. More recently, many VLA policies adopt flow-matching action heads, showing strong generative capability and promising performance in real- world robotic manipulation [8, 12, 14-16, 19]. These policies generate action chunks by transporting 2 initial noise variables to continuous actions through a learned state-conditioned velocity field. Our work focuses on this class of flow-matching VLA policies due to their competitive performance and leverages the initial noise variable as a natural interface for policy steering."},{"citing_arxiv_id":"2605.10819","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-11T16:37:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07381","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation","primary_cat":"cs.RO","submitted_at":"2026-05-08T07:35:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06175","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts","primary_cat":"cs.RO","submitted_at":"2026-05-07T12:56:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot success on LIBERO-Plus.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04678","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-06T09:27:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A unified comparison of latent action supervision strategies for VLA models reveals task-specific benefits, with image-based approaches aiding reasoning and generalization, action-based aiding motor control, and discrete tokens proving most effective.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Specifically, we attach a multi-layer perceptron (MLP) on top of the pre-quantization latent representation cimg t , which predicts the corresponding low-level action. For a randomly sampled subset of 5% of the training data with available ground-truth actionsa ∗ t , we optimize an additional action regression loss: Limg act =∥ˆat −a ∗ t ∥2 2 .(14) The full training objective of the image-based latent action model becomes: L=L img +λ actLimg act ,(15) whereλ act balances the auxiliary action supervision, we useλ act = 1.0. To verify the effectiveness of this design choice, we retrain the image-based latent action model without action supervision and evaluate the three image-based strategies on LIBERO. As shown in Tab. 6, all three variants still outperform the baseline (93.1% A VG), and the relative ordering among strategies is preserved; however, a consistent performance drop is observed"},{"citing_arxiv_id":"2605.03269","ref_index":25,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"RLDX-1 Technical Report","primary_cat":"cs.RO","submitted_at":"2026-05-05T01:40:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[23] Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. Offline reinforcement learning via high-fidelity generative behavior modeling. InInternational Conference on Learning Representations, 2023. [24] William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Training strategies for efficient embodied reasoning. InConference on Robot Learning, 2025b. [25] Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025c. [26] Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, and Hengdi Zhang."},{"citing_arxiv_id":"2605.00078","ref_index":84,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Being-H0.7: A Latent World-Action Model from Egocentric Videos","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:16:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Towards universal visual reward and representation via value-implicit pre-training. InThe Eleventh International Conference on Learning Representations, 2022. [83] Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, and Limin Wang. Como: Learning continuous latent motion from internet videos for scalable robot learning.arXiv preprint arXiv:2505.17006, 2025. [84] Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models. arXiv preprint arXiv:2507.23682, 2025. [85] Hao Luo, Ye Wang, Wanpeng Zhang, Haoqi Yuan, Yicheng Feng, Haiweng Xu, Sipeng Zheng, and Zongqing Lu."},{"citing_arxiv_id":"2604.22615","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GazeVLA: Learning Human Intention for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-24T14:46:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19734","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:57:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Moto [9] and LAPA [10] learn latent motion tokens from video, while UniVLA [ 11] uses vision-derived latent actions for cross- embodiment policy learning. While this offers cross-domain potential, such representations tend to entangle low-level appearance factors and miss fine-grained motor detail, underexploiting the structural priors available in human pose data. Villa-X [23] partially addresses this by incorporating action reconstruction as an auxiliary target, but the unidirectional vision-to-action objective still limits the precision of the learned motor representation. Concurrent works such as METIS [13] and XR-1 [24] takeboth vision and action as encoder inputsbut have not achieved explicit vision-action alignment."},{"citing_arxiv_id":"2604.12908","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \\rightarrow G$): Vision-Geometry Backbones over Language and Video Models","primary_cat":"cs.RO","submitted_at":"2026-04-14T15:57:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04502","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?","primary_cat":"cs.RO","submitted_at":"2026-04-06T07:57:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, and Yilun Du. Large video planner enables generalizable robot control, 2025. URL https://arxiv.org/ abs/2512.15840. [7] Feng Chen, Zhuxiu Xu, Tianzhe Chu, Xunzhe Zhou, Li Sun, Zewen Wu, Shenghua Gao, Zhongyu Li, Yanchao Yang, and Yi Ma. Gendexhand: Generative simulation for dexterous hands.arXiv preprint arXiv:2511.01791, 2025. [8] Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa- x: enhancing latent action modeling in vision-language- action models.arXiv preprint arXiv:2507.23682, 2025. [9] Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and"},{"citing_arxiv_id":"2604.03340","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Additively Compositional Latent Actions for Embodied AI","primary_cat":"cs.CV","submitted_at":"2026-04-03T08:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AC-LAM enforces additive composition on latent actions from visual transitions, yielding more structured and calibrated motion latents that improve downstream embodied policy learning over prior LAMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20231","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-02-23T18:41:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00110","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-18T14:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhysGen uses video models to learn physics for robots, outperforming baselines by up to 13.8% on Libero and matching specialized models in real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.06949","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos","primary_cat":"cs.RO","submitted_at":"2026-02-06T18:49:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robot post-training.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"Large Video Planner Enables Generalizable Robot Control.arXiv preprint arXiv:2512.15840, 2025. 4 [16] Xiaoyu Chen, Junliang Guo, Tianyu He, Chuheng Zhang, Pushi Zhang, Derek Cathera Yang, Li Zhao, and Jiang Bian. IGOR: Image-GOal Representations are the Atomic Control Units for Foundation Models in Embodied AI.arXiv preprint arXiv:2411.00785, 2024. 16 [17] Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. villa-X: Enhancing Latent Action Modeling in Vision-Language- Action Models.arXiv preprint arXiv:2507.23682, 2025. 4 [18] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon,"},{"citing_arxiv_id":"2601.07060","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-01-11T21:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For example, on \"clean a cluttered table,\" state-of-the-art policies typically succeed initially but fail mid-task, unable to reliably complete the full sequence. A fundamental limitation is the absence of structured af- fordance cues [39, 42, 58, 107, 135] and explicit state track- ing [16, 50]. Although existing models may infer the final goal and produce intermediate actions [18, 38, 112, 143, 146, 148], they lack internal representations that disambiguate which object should be targeted next, which part or region is relevant for interaction, where items should be placed or moved, or what motion is appropriate for the upcoming step. Consequently, many visually similar states become ambiguous, obscuring the underlying task stage and destabi-"},{"citing_arxiv_id":"2512.16811","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GeoPredict: Leveraging Predictive Kinematics and 3D Gaussian Geometry for Precise VLA Manipulation","primary_cat":"cs.CV","submitted_at":"2025-12-18T17:51:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GeoPredict improves VLA manipulation accuracy by adding predictive kinematic trajectories and 3D Gaussian workspace geometry as training-time depth-rendering supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.26433","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Co-Evolving Latent Action World Models","primary_cat":"cs.LG","submitted_at":"2025-10-30T12:28:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoLA-World jointly trains latent action models and world models with a warm-up phase to achieve co-evolution, matching or exceeding prior two-stage methods in video simulation quality and visual planning performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13073","ref_index":142,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey","primary_cat":"cs.RO","submitted_at":"2025-08-18T16:45:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"SmolVLA [31] SmolVLM-2 FM Propose a lightweight VLA with frozen SmolVLM-2 and flow-matching transformer. OneTwoVLA [139] π0 FM Integrate acting/reasoning in shared VLA backbone processing multi-view inputs. Tactile-VLA [140] π0 FM Integrate tactile sensing to enable force-aware, generalizable contact-rich manipulation. GR-3 [141] Qwen2.5-VL FM Combine VL data and few-shot trajectories for robust manipulation in long-horizon or unseen tasks. villa-X [142] PaliGemma FM Integrate proprioceptively grounded latent actions and robot actions in a joint diffusion process. GraspVLA [143] InternLM2 FM Enable sim-to-real and open-vocabulary grasping with synthetic data. and open-vocabulary grasping. villa-X [142] grounds latent actions in robot states and jointly models latent and robot actions via joint diffusion for structured vision-action in-"}],"limit":50,"offset":0}