{"total":31,"items":[{"citing_arxiv_id":"2606.30613","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sequential Planning via Anchored Robotic Keypoints","primary_cat":"cs.RO","submitted_at":"2026-06-29T17:48:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPARK reaches 43.7% success on six LIBERO-PRO cells by LLM-generated typed behavior trees plus multi-prompt perception and recovery, more than doubling CaP-Agent0 and VLA baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.30111","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Automating the Design of Embodied AgentArchitectures","primary_cat":"cs.RO","submitted_at":"2026-06-29T10:45:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Automated architecture search for embodied agents produces directional success-rate gains on vision-language and manipulation tasks while exposing limits from simulation noise and incomplete credit assignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29267","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Enhancing Part-Level Point Grounding for Any Open-Source MLLMs","primary_cat":"cs.CV","submitted_at":"2026-06-28T08:32:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A plug-in Q-Synth Module plus Attention-to-Point Decoder converts text-conditioned attention in frozen MLLMs into point heatmaps, improving part-level grounding accuracy on multiple datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.26800","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SSI-Policy: Learning Structured Scene Interfaces for Vision-Language Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-06-25T09:38:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSI-Policy uses an RGB-only Structured Scene Interface to improve LIBERO benchmark performance by nearly 15% with only 10 demonstrations per task compared to prior methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23856","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Point Tracking Improves World Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-22T17:08:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22183","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14274","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL","primary_cat":"cs.CV","submitted_at":"2026-05-14T02:18:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CreFlow combines LTL compositional rewards with credit-aware NFT and corrective reflow losses in online RL to improve embodied video diffusion models, raising downstream task success by 23.8 percentage points on eight bimanual manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11951","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T11:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, directly deploying MLLMs introduces two key challenges [55]: (i) slow inference caused by local computation and API-call latency, and (ii) imprecise grounding, as language-based instructions are coarse-grained without capturing details such as object positions, spatial relationships, exact angles, or distances. Striving for timely and accurate guidance, recent stud- ies [27, 55] have leveraged MLLMs to extract key constraints, compile them into executable monitoring code, and verify numerical conditions during execution. Although this approach minimizes the need for frequent VQA calls and enables real- time monitoring, the generated monitoring code is often overly simplistic and limited to a small set of hand-crafted error"},{"citing_arxiv_id":"2605.11144","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Forecast-aware Gaussian Splatting for Predictive 3D Representation in Language-Guided Pick-and-Place Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-11T18:48:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Forecast-GS predicts task-completed 3D states via Gaussian splatting to achieve higher success rates than baselines in real-world language-conditioned manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10307","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PaMoSplat: Part-Aware Motion-Guided Gaussian Splatting for Dynamic Scene Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-11T10:06:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PaMoSplat reconstructs dynamic scenes by lifting 2D segmentations to coherent 3D Gaussian parts and estimating their motions via optical flow-guided differential evolution for higher quality rendering and faster training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Index Terms-Gaussian Splatting, Dynamic Scene Reconstruc- tion, Novel View Synthesis I. INTRODUCTION 3 D dynamic scene reconstruction plays a critical role in numerous computer vision and robotics applications, par- ticularly in augmented and virtual reality (AR/VR) and real- to-sim transfer. More notably, the recent surge in tracking- any-point [1] or tracking-keypoint [2] policies highlights the pivotal role of dense tracking in advancing robot manipulation skills. The core challenge in dynamic reconstruction lies in accurately modeling both the geometry and appearance of 3D scenes while maintaining persistent tracking of moving elements, enabling high-fidelity novel-view synthesis and tem- porally consistent motion analysis."},{"citing_arxiv_id":"2605.07306","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-08T06:15:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, these methods often rely on relatively reliable depth perception, object segmentation, or three-dimensional reconstruction. In wet-lab scenarios involv- ing transparent labware, reﬂective surfaces, and liquid contain- ers, such perception outputs can become unstable. Meanwhile, Vision-Language-Action (VLA) models and imitation learning methods, such as X-VLA [ 7] and SmolVLA [ 8], have shown promising performance in language-conditioned robotic con- trol, dual-arm manipulation, and cross-embodiment generaliza- tion. Nevertheless, most VLA systems still emphasize direct observation-to-action mapping and lack explicit semantic ver- iﬁcation before and after execution. As a result, VLA execu- tion is commonly treated as an instruction-following process,"},{"citing_arxiv_id":"2605.05714","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation","primary_cat":"cs.CV","submitted_at":"2026-05-07T05:57:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Analysis and Machine Intelligence, 2025. [33] Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, et al. Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 2417-2425, 2024. [34] Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024. 11 [35] Haonan Wang, Hanyu Zhou, Haoyue Liu, and Luxin Yan. 4d-vggt: A general foundation model with spatiotemporal awareness for dynamic scene geometry estimation."},{"citing_arxiv_id":"2605.01448","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decompose and Recompose: Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-02T13:55:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Decompose and Recompose decomposes seen robotic demonstrations into skill-action alignments and recomposes them via visual-semantic retrieval and planning to enable zero-shot cross-task generalization.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"and clip each element to[0,⌊360 ◦/∆⌋ −1]. Gripper .The gripper command is kept discrete asg∈ {0,1}. A.4. Decoding: discrete LLM tokens→continuous control Given an LLM-predicted discrete actiona= [i,k, g], we recover a continuous control target as follows. Translation reconstruction.We map voxel indices to the center of the corresponding voxel cell: p=b min +r⊙i+r/2,(12) where⊙denotes elementwise multiplication. Rotation reconstruction.We recover Euler angles from bins by θ= ∆·k−180 ◦,(13) then convertθback to a quaternionqusing the same Euler convention as in the encoding step. 11 Reasoning New Skills from Existing Abilities for Cross-Task Robotic Manipulation Algorithm 1Decompose and Recompose: Skill-Based Cross-Task Manipulation"},{"citing_arxiv_id":"2604.23249","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances","primary_cat":"cs.RO","submitted_at":"2026-04-25T11:01:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BridgeACT learns robot manipulation from human videos alone by predicting task-relevant grasp regions and 3D motion affordances that map directly to robot controllers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21241","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors","primary_cat":"cs.RO","submitted_at":"2026-04-23T03:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CorridorVLA improves VLA models by using predicted sparse anchors to impose explicit spatial corridors on action trajectories, yielding 3.4-12.4% success rate gains on LIBERO-Plus with GR00T-Corr reaching 83.21%.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"corridor that directly guides a generative action head, pro- viding a lightweight and interpretable way to inject spatial objectives into continuous trajectory generation. B. View-Centered Spatial Grounding Several recent VLA works explore camera-centric or ego- centric formulations that build a unified representation space from the agent's first-person view, including OC-VLA [22], EgoVLA [23], and cVLA [24]. By treating the camera view as the primary reference frame, these methods aim to align perception with action in a view-consistent manner, which is broadly compatible with our motivation of using grounded representations to connect multimodal inputs and control. At the same time, camera-centered parameterizations in- herit practical variability across platforms: camera resolution,"},{"citing_arxiv_id":"2604.08983","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly","primary_cat":"cs.RO","submitted_at":"2026-04-10T05:43:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Tiger: Tool-integrated geometric reasoning in vision-language models for robotics.arXiv preprint arXiv:2510.07181, 2025. [19] Minho Heo, YoungwoonLee, Doohyun Lee, andJoseph J Lim. Furniturebench: Reproducible real-world bench- mark for long-horizon complex manipulation.The Inter- national Journal of Robotics Research, 44(10-11):1863- 1891, 2025. [20] Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024. [21] Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James, Pietro Morerio, and Alessio Del Bue. E-m3rf: An equivariant multimodal 3d re-"},{"citing_arxiv_id":"2604.07034","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis","primary_cat":"cs.RO","submitted_at":"2026-04-08T12:49:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"has increasingly turned to large language models (LLMs) and vision-language models (VLMs) as general-purpose reasoning interfaces. Foundation models offer open-vocabulary percep- tion, natural-language conditioning, reusable commonsense priors, and a single interface that can support planning, monitoring, explanation, and recovery across many tasks and embodiments [1]-[8]. Yet their strengths are blunted when the input is a long raw execution video: subtle failure cues are easily buried in dense visual detail, temporal context is diluted, and the evidence needed for diagnosis is rarely presented in a form that is immediately legible to the model. Prior work has begun to address this problem by summariz- ing robot experiences for an LLM [9], training failure-specific"},{"citing_arxiv_id":"2604.05430","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Synergizing Efficiency and Reliability for Continuous Mobile Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-07T04:55:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A framework integrates anticipatory planning and real-time feedback via reliability-aware optimization and phase switching to achieve efficient, reliable continuous mobile manipulation under uncertainty.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04974","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","primary_cat":"cs.RO","submitted_at":"2026-04-04T15:37:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.08392","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","primary_cat":"cs.RO","submitted_at":"2026-02-09T08:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. [54] Wenlong Huang, Fei Xia, Dhruv Shah, Danny Driess, Andy Zeng, Yao Lu, Pete Florence, Igor Mordatch, Sergey Levine, Karol Hausman, et al. Grounded decoding: Guiding text generation with grounded models for embodied agents.Ad- vances in Neural Information Processing Systems, 36:59636- 59661, 2023. [55] Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of rela- tional keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024. [56] brian ichter, Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, Dmitry Kalash-"},{"citing_arxiv_id":"2601.07060","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-01-11T21:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"which lacks explicit reasoning and fine-grained representa- tions of spatial or physical dynamics. Subsequent works address these limitations by incorporating future prediction through goal-image generation [26, 31, 38, 117, 136, 150] or integrated forecasting [ 112, 126, 143, 148, 153], or by enhancing spatio-temporal grounding via keypoint predic- tion [42, 135] and historical visual traces [93]. In contrast, our work introduces a closed perception-action-progress loop that improves long-horizon manipulation by integrating affordance and subtask progress reasoning into the VLA. Imitation Learning with Progress Supervision.Early imitation learning approaches for long-horizon tasks relied on explicit task decomposition, such as symbolic planning"},{"citing_arxiv_id":"2512.01773","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"IGen: Scalable Data Generation for Robot Learning from Open-World Images","primary_cat":"cs.RO","submitted_at":"2025-12-01T15:15:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.13778","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy","primary_cat":"cs.RO","submitted_at":"2025-10-15T17:30:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19102","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2025-09-23T14:49:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FunCanon introduces functional object canonicalization with VLM affordances to create pose-aware action primitives for generalizable imitation learning in robotic manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.14787","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"COMPASS: Confined-space Manipulation Planning with Active Sensing Strategy","primary_cat":"cs.RO","submitted_at":"2025-09-18T09:37:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COMPASS is a manipulation-aware active sensing framework that raises simulated manipulation success rates by 24.25% over information-gain-only baselines in a new four-level confined-space benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.13998","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2025-08-19T16:50:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Embodied-R1 uses a pointing-centric representation and reinforced fine-tuning on a 200K dataset to achieve state-of-the-art results on embodied benchmarks plus 56.2% success in SIMPLEREnv and 87.5% on real XArm tasks without task-specific training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.00990","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations","primary_cat":"cs.RO","submitted_at":"2025-07-01T17:39:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"The main downside of video generation is its substan- tial computational cost. Also, on a representational level, one may wonder whether predicting video pixels is waste- ful, and whether we should instead predict a more compact and minimal representation that can be efficiently translated to an executable trajectory. One example of this philosophy is the recent ReKep method [49], which uses a VLM to gen- erate relational keypoint constraints from a task description and then solves for a 6D trajectory given these constraints. We compare our approach to ReKep and demonstrate that video generation does, in fact, perform substantially bet- ter than the generation of a more sparse and high-level rep- resentation. Next, given a generated video, one may ask"},{"citing_arxiv_id":"2503.22020","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2025-03-27T22:23:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoT-VLA is a 7B VLA that generates future visual frames autoregressively as planning goals before actions, outperforming prior VLAs by 17% on real-world tasks and 6% in simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.10631","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model","primary_cat":"cs.CV","submitted_at":"2025-03-13T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"language for affordance-guided visual manipulation. arXiv preprint arXiv:2403.08355, 2024. [43] Jiaming Liu, Chenxuan Li, Guanqun Wang, Lily Lee, Kaichen Zhou, Sixiang Chen, Chuyan Xiong, Jiaxin Ge, Renrui Zhang, and Shanghang Zhang. Self-corrected multimodal large language model for end-to-end robot manipulation. arXiv preprint arXiv:2405.17418, 2024. [44] Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652, 2024. [45] Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robomamba: Multimodal state space model for efficient"},{"citing_arxiv_id":"2501.09747","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-01-16T18:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Bridging the human to robot dex- terity gap through object-oriented rewards, 2024. URL https://arxiv.org/abs/2410.23289. [31] Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. UMI on legs: Making manipulation policies mo- bile with manipulation-centric whole-body controllers. In Proceedings of the 2024 Conference on Robot Learning , 2024. [32] Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation. arXiv preprint arXiv:2409.01652 , 2024. [33] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE , 40(9):1098-1101, 1952. doi: 10.1109/JRPROC."},{"citing_arxiv_id":"2405.14093","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on Vision-Language-Action Models for Embodied AI","primary_cat":"cs.RO","submitted_at":"2024-05-23T01:43:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"This is the first survey on vision-language-action models, providing a taxonomy across three lines, plus summaries of datasets, simulators, benchmarks, challenges, and future directions in embodied AI.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Real-Mani(Franka): custom study desk RPT [28] ViT MAE P i∈MMSE (xi, fmae(x/∈M, y, z, . . .))x, y, z: three distinct modalities;M: masked set Real-Mani(Franka): stack, pick, pick from bin DINOv2 [34] ViT Self- distillation P x P x′̸=xH(Pt(x), Ps(x′))x, x ′: image views;H(): cross-entropy; Pt, Ps: teacher, student (Used by OpenVLA [35], ReKep [36]) I-JEPA [37] ViT JEPA P i∈MMSE (f′(xi), g(f(x/∈M)))x i: image block;M: target set;f, f′: encoder, EMA encoder;g: predictor Theia [38] ViT-T/S/B Distillation (Distillation of vision foundation models: ViT, CLIP, SAM, DINOv2, Depth-Anything.) Sim-Mani&Navi: CortexBench (VC-1); Real-Mani: pick, place, open door/drawer MVP [26] applies a masked autoencoder (MAE) from com-"}],"limit":50,"offset":0}