{"total":19,"items":[{"citing_arxiv_id":"2605.22183","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Action with Visual Primitives","primary_cat":"cs.RO","submitted_at":"2026-05-21T08:52:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20774","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VLA-REPLICA: A Low-Cost, Reproducible Benchmark for Real-World Evaluation of Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-20T06:15:30+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA-REPLICA is a low-cost and reproducible real-world benchmark for evaluating VLA models in robotic manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18617","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics","primary_cat":"cs.RO","submitted_at":"2026-05-18T16:26:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ManiSoft is a new benchmark featuring a soft-body simulator, four deformable control tasks, and an automated pipeline generating 6300 scenes with expert trajectories for training and evaluating vision-language policies on continuum robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15492","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FLASH: Efficient Visuomotor Policy via Sparse Sampling","primary_cat":"cs.RO","submitted_at":"2026-05-15T00:15:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FLASH Policy uses sparse Legendre polynomial trajectory fitting and history-anchored flow matching to enable single-step inference for visuomotor control, reporting 31.4 ms per-episode latency and >=92% success on five simulated plus two real manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12090","ref_index":236,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Action Models: The Next Frontier in Embodied AI","primary_cat":"cs.RO","submitted_at":"2026-05-12T13:10:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.","context_count":1,"top_context_role":"dataset","top_context_polarity":"background","context_text":"COLOSSEUM [219], AGNOSTOS [220], RoboEval [221], RoboVerse [222], PolaRiS [223] RoboMME [224], GenManip [ 225], VLABench [ 226], RoboSuite [227], RoboLab [228] SimplerEnv [229], ARNOLD [230], GemBench [231] Bimanual and Humanoid Form Robo T win [153], BiGym [232], HumanoidBench [ 233] HumanoidGen [234] Mobile Manipulation ManipulaTHOR [235], HomeRobot [236], BEHA VIOR-1K [237] Contact and Deformation Manipulation SoftGym [238], PlasticineLab [239], DaXBench [240] TacSL [241], ManiFeel [242] Real-Device RoboArena [243], RoboChallenge [244], Maniparena [245] Figure 2 The comprehensive roadmap and taxonomy of World Action Models (W AMs) reviewed in this survey. The literature is systematically categorized into four core dimensions: background ( Sec."},{"citing_arxiv_id":"2604.22363","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios","primary_cat":"cs.RO","submitted_at":"2026-04-24T08:53:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LeHome is a simulation platform offering high-fidelity dynamics for robotic manipulation of varied deformable objects in household settings, with support for multiple robot embodiments including low-cost hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16886","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Chain Of Interaction Benchmark (COIN): When Reasoning meets Embodied Interaction","primary_cat":"cs.RO","submitted_at":"2026-04-18T07:26:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"COIN provides 50 interactive robotic tasks, a 1000-demonstration dataset collected via AR teleoperation, and metrics showing that CodeAsPolicy, VLA, and H-VLA models fail at causally-dependent interactive reasoning due to gaps between vision and execution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15805","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation","primary_cat":"cs.RO","submitted_at":"2026-04-17T08:06:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"[11] Matt Deitke, Rose Hendrix, Luca Weihs, Ali Farhadi, Kiana Ehsani, and Aniruddha Kembhavi. Phone2proc: Bringing robust robots into our chaotic world, 2022. URL https://arxiv.org/abs/2212.04819. [12] Daniel DeTone, Tomasz Malisiewicz, and Andrew Ra- binovich. Superpoint: Self-supervised interest point de- tection and description, 2018. URL https://arxiv.org/abs/ 1712.07629. [13] Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, et al. Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learning.arXiv preprint arXiv:2504.18904, 2025. [14] Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shaoqing Xu, Haiyang Sun,"},{"citing_arxiv_id":"2604.05831","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BiCoord: A Bimanual Manipulation Benchmark towards Long-Horizon Spatial-Temporal Coordination","primary_cat":"cs.RO","submitted_at":"2026-04-07T13:02:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BiCoord is a new benchmark for long-horizon tightly coordinated bimanual manipulation that includes quantitative metrics and shows existing policies like DP, RDT, Pi0 and OpenVLA-OFT struggle on such tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"spatially and temporally entangled bimanual interactions are cru- cial for achieving human-level dexterity, motivating the design of bimanual manipulation that can mirror this intrinsic coordination. Recently, the community recognizes the importance of bimanual manipulation, and simulation-based benchmarks are developed, such as RoboTwin [7, 43] and RLBench2 [17]. These benchmarks provide bimanual manipulation tasks and expert demonstration data for facilitating data-driven learning. However, existing bench- marks still fall short in two aspects for capturing the full complex- ity of real-world coordination.(1) Short-horizon tasks:Existing benchmarks only focus on short tasks that can be completed within a few motion primitives (e."},{"citing_arxiv_id":"2604.16331","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning","primary_cat":"cs.RO","submitted_at":"2026-03-12T17:54:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.09023","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-02-09T18:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.07322","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Action-to-Action Flow Matching","primary_cat":"cs.RO","submitted_at":"2026-02-07T02:39:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A2A flow matching starts action generation from prior proprioceptive actions in latent space to enable single-step high-quality predictions in robotic policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.02078","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot","primary_cat":"cs.RO","submitted_at":"2026-01-05T12:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Genie Sim 3.0 introduces an LLM-powered scene generator, the first LLM-based automated evaluation benchmark, and a large open synthetic dataset that demonstrates zero-shot sim-to-real transfer for robotic manipulation policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.15840","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Video Planner Enables Generalizable Robot Control","primary_cat":"cs.RO","submitted_at":"2025-12-17T18:35:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09674","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2025-09-11T17:59:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimpleVLA-RL applies tailored reinforcement learning to VLA models, reaching SoTA on LIBERO, outperforming π₀ on RoboTwin, and surpassing SFT in real-world tasks while reducing data needs and identifying a 'pushcut' phenomenon.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.12768","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation","primary_cat":"cs.CV","submitted_at":"2025-07-17T03:48:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AnyPos automates task-agnostic action collection and inverse-dynamics modeling with arm/end-effector decoupling plus a direction-aware decoder, delivering 51% higher test accuracy and 30-40% better success rates on bimanual tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.18088","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2025-06-22T16:26:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RoboTwin 2.0 automates diverse synthetic data creation for dual-arm robots via MLLMs and five-axis domain randomization, leading to 228-367% gains in manipulation success.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1 Datasets and Benchmarks for Robotic Manipulation Physics-based simulators underpin modern manipulation research. Existing platforms provide complementary strengths: SAPIEN [48] enables dynamic interaction with 2,300+ articulated objects; ManiSkill2 [16] supplies millions of demonstrations; Meta-World [50], CALVIN [32], LIBERO [30], and RoboVerse [ 15] target multi-task, language-conditioned, lifelong, and domain-randomized settings; RoboCasa [35] offers large-scale human demonstrations but lacks automation and dual-arm focus. 11 Large-scale real-world datasets further bridge sim-to-real: AgiBot World [4], RoboMIND [47], Open X-Embodiment [36], and Bridge [12] contribute millions of trajectories across diverse tasks, robots,"},{"citing_arxiv_id":"2506.15953","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ViTacFormer: Learning Cross-Modal Representation for Visuo-Tactile Dexterous Manipulation","primary_cat":"cs.RO","submitted_at":"2025-06-19T01:38:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViTacFormer learns a cross-modal visuo-tactile latent space with autoregressive tactile prediction and an easy-to-hard curriculum, then uses the representation for imitation learning that yields ~50% higher success and the first reported 11-stage, 2.5-minute autonomous dexterous tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.02618","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rodrigues Network for Learning Robot Actions","primary_cat":"cs.RO","submitted_at":"2025-06-03T08:34:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Rodrigues Network using a learnable Neural Rodrigues Operator to add kinematic inductive biases for improved robot action learning and prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}