{"total":14,"items":[{"citing_arxiv_id":"2606.32009","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Human-as-Humanoid: Enabling Zero-Shot Humanoid Learning from Ego-Exo Human Videos with Human-Aligned Embodiments","primary_cat":"cs.RO","submitted_at":"2026-06-30T17:44:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Human-as-Humanoid converts ego-exo human videos into executable 60-DoF humanoid actions through embodiment alignment and retargeting, enabling zero-shot real-robot policy deployment without target-task teleoperation data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.28133","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots","primary_cat":"cs.RO","submitted_at":"2026-06-26T14:34:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A relative wrist translation bridging action with a vision-language-action model using interleaved tokens and attention masking transfers human manipulation skills to robots more effectively than 6DoF actions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06194","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ActiveMimic: Egocentric Video Pretraining with Active Perception","primary_cat":"cs.RO","submitted_at":"2026-06-04T14:01:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActiveMimic pretrains on egocentric human video by recovering and modeling active camera motion as viewpoint actions, matching robot-data pretraining performance on real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03177","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ConTrack: Constrained Hand Motion Tracking with Adaptive Trade-off Control","primary_cat":"cs.RO","submitted_at":"2026-06-02T05:31:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ConTrack introduces a constrained RL method with online dual-variable adaptation and adaptive resets for improved long-horizon hand tracking in simulation and on real robots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00054","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data","primary_cat":"cs.RO","submitted_at":"2026-05-18T06:19:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys four classes of techniques that derive action-related supervision from human videos for VLA robot models and identifies three open challenges in episode structuring, embodiment grounding, and evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16743","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LACE: Latent Visual Representation for Cross-Embodiment Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T01:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16412","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCAR: Self-Supervised Continuous Action Representation Learning","primary_cat":"cs.RO","submitted_at":"2026-05-13T16:23:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCAR proposes a joint inverse-forward dynamics framework to learn transferable continuous action representations across embodiments from visual data using regularization and adversarial invariance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00080","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"World Model for Robot Learning: A Comprehensive Survey","primary_cat":"cs.RO","submitted_at":"2026-04-30T14:35:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datasets, and benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Then several seemingly distinct paradigms can be viewed as different marginals or conditionals of the same underlying predictive-control model: Policy Model:p(a t+1:t+k |o t, l) = Z p(ot+1:t+k, at+1:t+k |o t, l)d o,(5) Passive World Model:p(o t+1:t+k |o t, l) = Z p(ot+1:t+k, at+1:t+k |o t, l)d a,(6) Controllable World Model:p(o t+1:t+k |o t, at+1:t+k),(7) Inverse Dynamics Model:p(a t+1:t+k |o t:t+k).(8) In this sense, policy model, passive world model (video generation model), controllable world model and inverse dynamics model are not entirely separate abstractions; rather, they correspond to different ways of querying or factorizing the same idealized joint distribution. This also explains why world models and policies can be naturally coupled: a policy may use future observations generated by a world model as an intermediate"},{"citing_arxiv_id":"2604.22615","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GazeVLA: Learning Human Intention for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-04-24T14:46:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GazeVLA pretrains on large human egocentric datasets to capture gaze-based intention, then finetunes on limited robot data with chain-of-thought reasoning to achieve better robotic manipulation performance than baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Building upon the success of Vision-Language Models (VLM) [1, 42, 57, 87], Vision-Language-Action (VLA) [8,36,61] models have made significant strides in embodied intelligence. Current VLA research predominantly employs autore- gressive tokenization strategies [9,36,89] and continuous action spaces formu- lated through flow-based generative paradigms [8,12,21,32,35,69]. To further enhance reasoning capabilities, recent works have integrated Chain-of-Thought (CoT) into VLA models, utilizing task decomposition [32,35,74,75] or inter- mediate signal prediction [21,75,79,85] to bolster logical consistency. However, the performance of these models remains heavily contingent on large-scale, high- quality real-robot data [51,62]."},{"citing_arxiv_id":"2604.19734","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:57:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Scaling foundation models for humanoids in both policy learning and world modeling is fundamen- tally bottlenecked by scarce high-quality robotic data. Massive, structured human motion sequences from low-cost capture provide a scalable alternative rich in physical interaction priors, but leveraging them requires bridging a major cross-embodiment gap[1]. Biomechanical and hardware differences create heterogeneous state-action spaces with mismatched degrees of freedom (DoF) and control paradigms. Traditional pipelines rely onmotion retargeting[ 2, 3], which uses complex kinematic solvers to map human motions to specific robots. This case-by-case process is labor-intensive, unscal- able, and often physically inconsistent."},{"citing_arxiv_id":"2604.13645","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-04-15T09:14:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sim-and-real co-training for robot policies is driven primarily by balanced cross-domain representation alignment and secondarily by domain-dependent action reweighting.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"MugHang Decoder-only U-net Figure 14.Balanced mixing ratios are robust to different policy architectures.Each data point is computed over three policy checkpoints, and each checkpoint is evaluated for 200 trials. The best performance is consistently achieved in the range of (0.016,0.3) . that the best performance consistently lies in a narrow range ([0.016, 0.3]), indicating robustness beyond the architecture used in the main paper. D.4. Comparison to \"Simulation Pre-training+Real Fine-tine\" Although some prior work shows that it is generally less effective than co-training, as this still remains as an important baseline, we include a direct comparison here. We pre-train with simulation data only for∼130ksteps and fine-tune with"},{"citing_arxiv_id":"2604.07607","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World","primary_cat":"cs.RO","submitted_at":"2026-04-08T21:27:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot objectives.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of actions via inverse dynamics models [6, 13, 55], affor- dances [2, 44], or point tracking [4, 42, 50] for policy training, forming a basis for some foundation models [8, 37, 38, 54], yet often still necessitating in-domain robot data. Alternatively, labeled human demonstrations can be co-trained with robot data as distinct embodiments for policy learning [22, 29, 32, 35, 41, 59], post-training [7, 30], and world modeling [18, 24]. These works found that this practice enhances robustness and scene understanding. However, such findings remain confined to limited scale and single robot embodiment, leaving critical questions about multiple robot embodiments and varied human data sources largely unexplored. Our work addresses these fundamental gaps through a large-scale human dataset and a"},{"citing_arxiv_id":"2603.03243","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HoMMI: Learning Whole-Body Mobile Manipulation from Human Demonstrations","primary_cat":"cs.RO","submitted_at":"2026-03-03T18:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HoMMI learns whole-body mobile manipulation policies from robot-free human demonstrations by augmenting UMI with egocentric sensing and bridging the embodiment gap through an agnostic visual representation, relaxed head actions, and a whole-body controller.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.18127","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting","primary_cat":"cs.CV","submitted_at":"2025-11-22T17:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SFHand presents the first streaming language-guided autoregressive framework for 3D hand forecasting, achieving up to 35.8% gains over prior methods and 13.4% better downstream embodied task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}