{"total":20,"items":[{"citing_arxiv_id":"2607.01067","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-07-01T15:26:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces H-Tac human tactile-action dataset and TTP pre-training that unifies spaces and predicts future tactile signals to improve robotic dexterous manipulation transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00678","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ABot-M0.5: Unified Mobility-and-Manipulation World Action Model","primary_cat":"cs.CV","submitted_at":"2026-07-01T09:21:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ABot-M0.5 proposes a unified mobility-and-manipulation world action model using three alignment strategies that achieves state-of-the-art performance on mobile and fine-grained manipulation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28276","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SimFoundry: Modular and Automated Scene Generation for Policy Learning and Evaluation","primary_cat":"cs.RO","submitted_at":"2026-06-26T17:18:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimFoundry automates zero-shot real-to-sim scene generation from video, producing digital twins and cousins that enable policy training with 0.911 mean Pearson correlation to real-world results and 17-40% success gains from variations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25939","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeformGen: Dynamics-Based Topology Augmentation for Deformable Manipulation Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-06-24T15:15:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeformGen uses dynamics-based state expansion via localized disturbances and deformation-field warping for trajectory transfer to improve policy learning on deformable manipulation benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13674","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RepWAM: World Action Modeling with Representation Visual-Action Tokenizers","primary_cat":"cs.CV","submitted_at":"2026-06-11T17:59:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RepWAM introduces representation visual-action tokenizers to pretrain world action models that jointly model future visual states and latent actions under instructions for improved robot manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13578","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories","primary_cat":"cs.CL","submitted_at":"2026-06-11T17:03:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LabVLA uses RoboGenesis simulation data and a two-stage FAST pretraining plus flow matching recipe on a Qwen3-VL backbone to achieve the highest success rates on LabUtopia under in- and out-of-distribution conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10366","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Practical Recipe Towards Improving Sim-and-Real Correlation for VLA Evaluation","primary_cat":"cs.RO","submitted_at":"2026-06-09T03:25:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Authors perform a cross-simulator, cross-policy empirical study of sim-to-real correlation for VLA policies and distill guidance on using simulation for policy improvement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06155","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding","primary_cat":"cs.RO","submitted_at":"2026-06-04T13:28:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04463","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OSCAR: Omni-Embodiment Action-Conditioned World Model for Robotics","primary_cat":"cs.RO","submitted_at":"2026-06-03T05:16:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OSCAR finetunes Cosmos-Predict2.5-2B on a deduplicated multi-embodiment robotics dataset with kinematic skeleton conditioning, claiming better action following and significant correlation between virtual and real robot policy evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27724","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning","primary_cat":"cs.RO","submitted_at":"2026-05-26T21:57:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HumanoidMimicGen automatically generates large loco-manipulation datasets from few source demonstrations using whole-body planning, enabling visuomotor policies that outperform real-data-only training by 20% on a new nine-task benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26638","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HyperSim: A Holistic Sim-To-Real Framework For Robust Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-26T07:19:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HyperSim reports 80% and 95% sim-to-real success on two manipulation policies across 400 real executions by combining synthetic environment synthesis, adversarial trajectories, and co-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21372","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training","primary_cat":"cs.CV","submitted_at":"2026-05-20T16:36:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16137","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System","primary_cat":"cs.CV","submitted_at":"2026-05-15T16:18:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"STABLE generates simulation-ready tabletop scenes by alternating a semantic LLM reasoner for task-aligned coarse layouts with a physics corrector for physical plausibility using progressive scene expansion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01799","ref_index":45,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Embody4D: A Generalist Data Engine for Embodied 4D World Modeling","primary_cat":"cs.CV","submitted_at":"2026-05-03T09:39:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Embody4D generates novel-view videos from monocular robot videos via a 3D-aware synthesis pipeline, confidence-aware expert modulation, and interaction-aware attention for embodied 4D world modeling.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"To train this model, we initially utilize a curated dataset of 23K 4D samples synthesized via foreground-background composition, enabling the model 10 F. Author et al. to learn diverse robotic arm morphologies and enhancing its 4D consistency. Sub- sequently, we leverage 24K monocular embodied data uniformly sampled from five datasets (AGIBOT [7], Rh20t [15], Robset [6], Bc-z [23], and Interndata- A1 [45]) to learn real-world robotic arm interactive operations. We evaluate our method on a test set of 120 monocular videos, consisting of 70 samples synthesized via our composition-based approach to include di- verse robotic arms, and 50 real-world samples curated from various open-source datasets. Each video contains 49 frames with a resolution of 384×672."},{"citing_arxiv_id":"2604.26694","ref_index":74,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising","primary_cat":"cs.RO","submitted_at":"2026-04-29T14:01:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025. 14 [73] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. [74] Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, et al. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.arXiv preprint arXiv:2511.16651, 2025. [75] Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox."},{"citing_arxiv_id":"2604.20100","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy","primary_cat":"cs.RO","submitted_at":"2026-04-22T01:51:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"JoyAI-RA is a multi-source pretrained VLA model that claims to bridge human-to-robot embodiment gaps via data unification and outperforms prior methods on generalization-heavy robotic tasks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"EgoDex [18] further demonstrates that large- scale egocentric human videos can provide transferable priors for manipulation, especially when large-scale real-robot teleoperation data collection is costly. Simulation data is commonly used to complement these data with scalable and controllable action supervision [11, 24, 33, 40]. For example, InternData-A1 [33] provides large-scale synthetic robot demonstrations across multiple embodiments for policy pretraining. Building on these developments, JoyAI-RA propose our in-house human egocentric dataset EgoLive and our self-collected real-robot data JDAgibot. We explicitly structures multi-source data integration to exploit the complementary roles of real-robot, human, and simulation data for generalizable manipulation learning."},{"citing_arxiv_id":"2604.08544","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds","primary_cat":"cs.RO","submitted_at":"2026-04-09T17:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformable manipulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05484","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment","primary_cat":"cs.RO","submitted_at":"2026-04-07T06:24:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer to achieve high success rates on multi-arm manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"mechanisms to integrate complementary skills or to ensure geometric consistency acrossoverlappingperspectives.Unlikeapproachesthatgeneratepoliciesdirectly for isolated execution, we leverage the reasoning capabilities of vision-language models to synthesize and refine multi-agent actions collaboratively. 2.3 Simulation-based Robot Learning Simulation platforms, such as Isaac Sim [40] and MuJoCo [55], constitute essen- tial infrastructure for robotic learning, facilitating the safe exploration of com- plex control strategies without the risk of physical damage to hardware [41,43]. Conventional methodologies typically involve training control policies exclu- sively within virtual settings, subsequently utilizing techniques such as domain randomization or system identification to bridge the sim-to-real gap during"},{"citing_arxiv_id":"2603.15956","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors","primary_cat":"cs.RO","submitted_at":"2026-03-16T22:12:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.21998","ref_index":74,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Causal World Modeling for Robot Control","primary_cat":"cs.CV","submitted_at":"2026-01-29T17:07:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LingBot-VA combines video world modeling with policy learning via Mixture-of-Transformers, closed-loop rollouts, and asynchronous inference to improve robot manipulation in simulation and real settings.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"systems is:7 EEF + 7joints + 1gripper per arm, resulting in(7 + 7 + 1)×2 = 30dimensions. Training Data Composition.We aggregate data from six sources spanning diverse embodiments, environments, and task categories: •Agibot[2]: Large-scale dataset with diverse manipulation tasks from mobile manipulators. •RoboMind[81]: Multi-embodiment manipulation demonstrations. •InternData-A1[74]: Large-scale simulation dataset for sim-to-real transfer. •OXE[53]: Multi-embodiment dataset; we use the OpenVLA subset. • UMI Data[ 18, 45, 48, 51, 60, 92]: Human demonstration dataset collected via universal manipulation interface 1, excluding DexUMI. •RoboCOIN[84]: Cross-embodiment bimanual robotics data. In total, our training corpus comprises approximately16Khours of robot manipulation data across diverse tasks and"}],"limit":50,"offset":0}