AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

Guo Ye; Han Liu; Haoran Lu; Jianshu Zhang; Jiayi Wang; Mutian Shen; Shang Wu; Shuyang Yu; Songling Liu; Yue Chen

arxiv: 2606.17446 · v1 · pith:C7JA4EIJnew · submitted 2026-06-16 · 💻 cs.RO · cs.CV

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

Haoran Lu , Mutian Shen , Shuyang Yu , Yu Xiao , Songling Liu , Jianshu Zhang , Shang Wu , Yue Chen

show 4 more authors

Guo Ye Jiayi Wang Zhaoran Wang Han Liu

This is my paper

Pith reviewed 2026-06-27 01:13 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords automatic annotation3D assetsrobot manipulationvision-language reasoningphysics simulationgrasp generationaffordance detectionsimulation data collection

0 comments

The pith

AnnotateAnything converts raw 3D assets into manipulation-ready objects by pairing vision-language reasoning with physics-based action generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AnnotateAnything, a framework that adds semantic, interactive, and physical labels to passive 3D models so robots can use them directly in simulation and planning. A visual-language pipeline first infers object meanings, interaction regions, and constraints from geometry and appearance. A separate physics pipeline then generates concrete executable actions such as grasp poses, dexterous contacts, articulation paths, and insertion directions by optimizing candidates against each asset's geometry and dynamics. If the method works, robot data collection could scale without repeated manual labeling of new assets. Experiments in the paper report higher annotation speed, faster parallel data collection across embodiments, and improved task success rates relative to prior annotation pipelines, while the resulting labels also feed into affordance detection and visual question answering.

Core claim

AnnotateAnything turns passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. It does so via a unified visual-language annotation pipeline that uses vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, followed by a fully automatic and massively parallel physics annotation pipeline that grounds these priors through candidate generation, geometry optimization, and trajectory generation to produce grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets.

What carries the argument

Two complementary pipelines: a visual-language reasoning stage that supplies human-prior guidance on interaction regions and a physics stage that performs candidate generation, geometry optimization, and trajectory generation to produce executable labels.

If this is right

Annotation efficiency exceeds that of existing manual and automatic pipelines.
Data-collection efficiency increases through asynchronous parallel simulation across objects, tasks, and robot embodiments.
Task success rates rise on manipulation benchmarks relative to prior data-generation methods.
Generated labels directly support downstream applications including affordance detection, robotic visual question answering, and visual instruction finetuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Public repositories of 3D models could become immediate sources of robot training data once the annotation step is removed from the critical path.
The same pipelines might be applied to scanned real-world objects if the physics stage is extended to tolerate sensor noise.
Task instructions for visual finetuning could be derived automatically from the same semantic and constraint outputs without separate human authoring.

Load-bearing premise

The vision-language reasoning pipeline produces accurate human-prior guidance on interaction regions and constraints that the physics pipeline can ground into executable actions without systematic errors or the need for manual fixes.

What would settle it

Measure task success rates of robot policies trained on annotations produced by AnnotateAnything versus policies trained on the same assets with manual or alternative automatic annotations; rates below the manual baseline on held-out tasks would falsify the central efficiency and quality claim.

Figures

Figures reproduced from arXiv: 2606.17446 by Guo Ye, Han Liu, Haoran Lu, Jianshu Zhang, Jiayi Wang, Mutian Shen, Shang Wu, Shuyang Yu, Songling Liu, Yue Chen, Yu Xiao, Zhaoran Wang.

**Figure 2.** Figure 2: Hierarchical visual-language annotation pipeline. AnnotateAnything generates asset-level [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Physics-based action annotation pipeline. Grounded visual-language priors are converted [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Representative atomic skills: covering tabletop, bimanual, whole-body, dexterous, and [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Downstream applications supported by AnnotateAnything. The same visual-language [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Annotation coverage over heterogeneous assets. Condensed coverage statistics: (A) processed asset pool grouped by asset type and source family, (B) physics validation outcomes grouped by skill family, and (C) candidate-to-annotation conversion, showing how generated candidates are filtered into feasible, physics-validated, and retained action annotations. 5.3 Visual-Language Annotation Quality The visual-… view at source ↗

**Figure 7.** Figure 7: Atomic skill taxonomy. Qualitative example of annotation and atomic skill 5.4 Physics-grounded Action Annotation Quality The physics stage converts grounded visual-language priors into executable action banks. For each valid asset–skill pair, it generates candidates, applies geometry and embodiment constraints, optimizes poses or trajectories, and validates the results in simulation [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 8.** Figure 8: Qualitative gallery of asset-level annotations across diverse object categories. Each row [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative examples of room-level visual annotations. From left to right: full 3D room [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative examples of action annotations across manipulation skills. Each row shows [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Additional qualitative examples of action annotations for articulated and contact-rich [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Supplementary illustration of asset-level language annotation. We automatically normalize [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Common failure cases observed during physics-based validation of automatically generated [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Composition of a room-scale scene used in room-level visual annotation. The center [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Closed-loop rollouts of policies trained with our annotations on representative downstream [PITH_FULL_IMAGE:figures/full_fig_p071_15.png] view at source ↗

**Figure 16.** Figure 16: Downstream tasks supported by the proposed annotations. Object, keypoint, and affordance [PITH_FULL_IMAGE:figures/full_fig_p073_16.png] view at source ↗

read the original abstract

Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AnnotateAnything pairs a VLM stage for semantic priors with a parallel physics pipeline to label 3D assets, but the VLM reliability is the untested load-bearing piece.

read the letter

The punchline is that AnnotateAnything combines vision-language models with a parallel physics pipeline to automatically generate structured manipulation labels for 3D assets, addressing data scarcity in robot learning. This integration allows for diverse executable annotations like grasps and articulation waypoints that can then feed into simulation-based data collection.

The paper does a good job laying out the two complementary pipelines and showing how they produce labels across different tasks and robot embodiments. The asynchronous parallel simulation system built on these annotations is a practical step toward scalable training. Supporting downstream applications such as affordance detection and visual instruction finetuning adds to its utility. Releasing the code, annotations, and benchmark as planned will make this work more impactful for the community.

The soft spots center on the evaluation of the vision-language reasoning pipeline. The method assumes that the VLM can reliably infer object semantics, interaction constraints, and 3D-grounded cues without systematic errors that the physics stage cannot fix. Yet the abstract provides no quantitative measures of VLM accuracy, no error analysis, and no ablations comparing the full system to a physics-only version. The claims of superior annotation efficiency, data-collection efficiency, and task success rates over existing pipelines are stated but not backed by specific numbers or detailed baselines in the summary. If the VLM priors are frequently off, the generated trajectories could be valid in simulation but not useful for real manipulation.

This paper is for roboticists and machine learning researchers focused on simulation-to-real transfer and large-scale data generation for manipulation tasks. A reader building datasets or training policies on diverse objects would find the framework and released materials directly applicable.

It deserves a serious referee because the problem it targets is central to scaling robot learning, and the proposed solution is a clear engineering contribution worth detailed review and potential adoption after addressing the validation gaps.

Referee Report

3 major / 2 minor

Summary. The paper presents AnnotateAnything, an automatic annotation framework for 3D assets that combines a vision-language reasoning pipeline (to infer semantics, interaction constraints, and 3D-grounded cues) with a parallel physics pipeline (for candidate generation, geometry optimization, and trajectory generation). This produces structured manipulation labels such as grasp poses, dexterous contacts, articulation waypoints, and affordances. The annotations enable an asynchronous parallel simulation data-collection system, with claims of superior annotation efficiency, data-collection efficiency, and downstream task success rates compared to existing pipelines, plus support for affordance detection, robotic VQA, and visual instruction finetuning. The authors plan to release code, annotations, and benchmarks.

Significance. If the performance claims hold with rigorous evidence, the work could substantially reduce the manual effort required to prepare 3D assets for robot manipulation learning, enabling scalable data generation across diverse objects and embodiments. The explicit plan to release full code, annotations, and a benchmark is a clear strength that would support reproducibility and follow-on research.

major comments (3)

[Abstract] Abstract: the central claim of superior annotation efficiency, data-collection efficiency, and task success rates is asserted without any quantitative metrics, error bars, dataset sizes, comparison baselines, or statistical tests, preventing evaluation of whether the VLM-plus-physics pipeline actually outperforms existing methods.
[Method (VLM pipeline)] VLM pipeline description: the load-bearing assumption that vision-language reasoning produces accurate human-prior guidance on interaction regions and constraints (without systematic semantic errors) is untested; no VLM error rates, inter-annotator agreement scores, or ablation isolating VLM contribution versus physics grounding are reported, so it is unclear whether downstream trajectories remain semantically valid when VLM outputs contain mistakes.
[Experiments] Experiments section: the reported task success rates and efficiency gains lack details on the number of objects, robot embodiments, task categories, or statistical significance, making it impossible to assess whether the asynchronous parallel data-collection system delivers the claimed improvements over prior annotation pipelines.

minor comments (2)

[Abstract] The abstract and method sections use terms such as 'human-prior guidance' and 'executable action annotations' without a precise definition or example of the output label format.
[Conclusion] Project page and supplementary materials are referenced but no explicit list of released assets (e.g., number of annotated objects, annotation schema) is provided in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that add the requested quantitative details, analyses, and clarifications without altering the core claims or methodology.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of superior annotation efficiency, data-collection efficiency, and task success rates is asserted without any quantitative metrics, error bars, dataset sizes, comparison baselines, or statistical tests, preventing evaluation of whether the VLM-plus-physics pipeline actually outperforms existing methods.

Authors: We agree that the abstract would be strengthened by including concrete metrics. In the revised version we will add specific quantitative results drawn from the experiments section (e.g., annotation-time reductions, data-collection throughput, and task success rates with error bars, dataset sizes, and baseline comparisons) while preserving the high-level summary. revision: yes
Referee: [Method (VLM pipeline)] VLM pipeline description: the load-bearing assumption that vision-language reasoning produces accurate human-prior guidance on interaction regions and constraints (without systematic semantic errors) is untested; no VLM error rates, inter-annotator agreement scores, or ablation isolating VLM contribution versus physics grounding are reported, so it is unclear whether downstream trajectories remain semantically valid when VLM outputs contain mistakes.

Authors: The physics pipeline is designed to ground and correct VLM outputs through geometry optimization and constraint checking. We acknowledge that an explicit error analysis and ablation would improve transparency. In revision we will add VLM output accuracy metrics on a held-out subset and an ablation comparing full pipeline performance with and without VLM priors. revision: yes
Referee: [Experiments] Experiments section: the reported task success rates and efficiency gains lack details on the number of objects, robot embodiments, task categories, or statistical significance, making it impossible to assess whether the asynchronous parallel data-collection system delivers the claimed improvements over prior annotation pipelines.

Authors: We will expand the experiments section to report the exact numbers of objects, embodiments, and task categories used, together with statistical significance tests (e.g., paired t-tests or Wilcoxon tests) and confidence intervals for all efficiency and success-rate comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: pipeline claims rest on independent experimental evaluation

full rationale

The paper presents AnnotateAnything as a two-stage annotation system (VLM reasoning followed by physics grounding) that generates labels for downstream robot tasks. No mathematical derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. Performance claims are supported by comparisons to existing pipelines on efficiency and task success metrics, which are external to the method's internal definitions. No load-bearing self-citations or uniqueness theorems are invoked. The central assumption about VLM reliability is an empirical claim open to falsification rather than a definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method relies on pre-trained vision-language models and physics engines assumed to be available from prior literature.

pith-pipeline@v0.9.1-grok · 5854 in / 1207 out tokens · 29887 ms · 2026-06-27T01:13:11.502727+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

109 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Pi-0.7: A steerable generalist robotic foundation model with emergent capabilities

Physical Intelligence. Pi-0.7: A steerable generalist robotic foundation model with emergent capabilities. arXiv preprint, 2026. CorpusID: 287607456

2026
[2]

Gr00t n1: An open foundation model for generalist humanoid robots

NVIDIA. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[3]

Gen-0: Embodied foundation models that scale with physical interaction

Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction. Generalist AI Blog, 2025. November 4, 2025

2025
[4]

World action models are zero-shot policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, and Joel Jang. World action models are zero-shot policies. arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026
[5]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data

Ruijie Zheng, Dantong Niu, Yuqi Xie, and Linxi Jim Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data. arXiv:2602.16710, 2026

arXiv 2026
[6]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025
[7]

Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026

Yue Chen, Muqing Jiang, Kaifeng Zheng, Jiaqi Liang, Chenrui Tie, Haoran Lu, Ruihai Wu, and Hao Dong. Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026. URLhttps://arxiv.org/abs/2602.14193

arXiv 2026
[8]

Shirwatkar, N

Ruihai Wu, Haozhe Chen, Mingtong Zhang, Haoran Lu, Yitong Li, and Yunzhu Li. Neural dynamics augmented diffusion policy. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13234–13241, 2025. doi: 10.1109/ICRA55743.2025.11128651

work page doi:10.1109/icra55743.2025.11128651 2025
[9]

Biassemble: Learning collaborative affordance for bimanual geometric assembly.ArXiv, abs/2506.06221, 2025

Yan Shen, Ruihai Wu, Yubin Ke, Xinyuan Song, Zeyi Li, Xiaoqi Li, Hongwei Fan, Haoran Lu, and Hao Dong. Biassemble: Learning collaborative affordance for bimanual geometric assembly.ArXiv, abs/2506.06221, 2025. URL https://api.semanticscholar.org/ CorpusID:279244917

arXiv 2025
[10]

Broadcasting support relations recursively from local dynamics for object retrieval in clutters

Yitong Li, Ruihai Wu, Haoran Lu, Chuanruo Ning, Yan Shen, Guanqi Zhan, and Hao Dong. Broadcasting support relations recursively from local dynamics for object retrieval in clutters. ArXiv, abs/2406.02283, 2024. URL https://api.semanticscholar.org/CorpusID: 270226492

arXiv 2024
[11]

Garmentpile++: Affordance-driven cluttered garments retrieval with vision-language reasoning

Mingleyang Li, Yuran Wang, Yue Chen, Tianxing Chen, Jiaqi Liang, Zishun Shen, Haoran Lu, Ruihai Wu, and Hao Dong. Garmentpile++: Affordance-driven cluttered garments retrieval with vision-language reasoning. Preprint, 2026. CorpusID: 286238271

2026
[12]

Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence

Ruihai Wu, Haoran Lu, Yiyan Wang, Yubo Wang, and Hao Dong. Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16340–16350, 2024. URL https://api.semanticscholar.org/CorpusID:269757227

2024
[13]

Learning to feel the future: Dreamtacvla for contact-rich manipulation.ArXiv, abs/2512.23864, 2025

Guo Ye, Zexi Zhang, Xu Zhao, Shang Wu, Haoran Lu, Shihan Lu, and Han Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.ArXiv, abs/2512.23864, 2025. URLhttps://api.semanticscholar.org/CorpusID:284350273. 11

Pith/arXiv arXiv 2025
[14]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learn- ing.ArXiv, abs/2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Char- lie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran G...

arXiv 2025
[15]

Unigarment: A unified simulation and benchmark for garment manipulation

Haoran Lu, Yitong Li, Ruihai Wu, Chuanruo Ning, Yan Shen, and Hao Dong. Unigarment: A unified simulation and benchmark for garment manipulation. Preprint, 2024. CorpusID: 275782214

2024
[16]

Garmentlab: A unified simulation and benchmark for garment manipulation.ArXiv, abs/2411.01200, 2024

Haoran Lu, Ruihai Wu, Yitong Li, Sijie Li, Ziyu Zhu, Chuanruo Ning, Yan Shen, Longzan Luo, Yuanpei Chen, and Hao Dong. Garmentlab: A unified simulation and benchmark for garment manipulation.ArXiv, abs/2411.01200, 2024. URL https://api.semanticscholar.org/ CorpusID:273811186

arXiv 2024
[17]

Lightwheel kitchen: 3d kitchen asset collection for nvidia isaac sim

Lightwheel. Lightwheel kitchen: 3d kitchen asset collection for nvidia isaac sim. GitHub repos- itory, 2025. URL https://github.com/LightwheelAI/Lightwheel_Kitchen. Open- source 3D assets under CC BY-NC 4.0

2025
[18]

Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Munoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Pith/arXiv arXiv 2025
[19]

Lw-benchhub: Lightwheel’s end-to-end embodied ai simulation platform

Lightwheel Team. Lw-benchhub: Lightwheel’s end-to-end embodied ai simulation platform. GitHub repository, 2025. URLhttps://github.com/lightwheel-ai/lw_benchhub

2025
[20]

Matthews, Ivan Villa-Renteria, Jerry Huayang Tang, Claire Tang, Fei Xia, Yunzhu Li, Silvio Savarese, Hyowon Gweon, C

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Jose Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Ca...

Pith/arXiv arXiv 2024
[21]

Mind to hand: Purposeful robotic control via embodied reasoning, 2025

Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, and Jianan Wang. Mind to hand: Purposeful robotic control via embodied reasoning, 2025. URLhttps://arxiv.org/abs/2512.08580

arXiv 2025
[22]

Spatialvla: Exploring spatial representations for visual-language-action model.ArXiv, abs/2501.15830, 2025

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yani Ding, Zhigang Wang, Jiayuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model.ArXiv, abs/2501.15830, 2025. URL https://api.semanticscholar.org/CorpusID:275921131

Pith/arXiv arXiv 2025
[23]

Scalable trajectory generation for whole-body mobile manipulation

Yida Niu, Xinhai Chang, Xin Liu, Ziyuan Jiao, and Yixin Zhu. Scalable trajectory generation for whole-body mobile manipulation. InCVPR, 2026

2026
[24]

Molmob0t: Large-scale simulation enables zero-shot manipulation

Abhay Deshpande, Maya Guru, Rose Hendrix, Snehal Jauhri, Haoquan Fang, Wilbert Pumacay, Yejin Kim, Quinn Pfeifer, Ying-Chun Lee, Piper Wolters, Omar Rayyan, Mingtong Zhang, Karen Farley, Winson Han, Eli Vanderbilt, Dieter Fox, Ali Farhadi, Georgia Chalvatzaki, Dhruv Shah, and Ranjay Krishna. Molmob0t: Large-scale simulation enables zero-shot manipulation....

2026
[25]

Manitwin: Scaling data-generation-ready digital object dataset to 100k

Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan-ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, and Ping Luo. Manitwin: Scaling data-generation-ready digital object dataset to 100k. Preprint, 2026. CorpusID: 286583937. 12

2026
[26]

Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

Baijun Chen, Weijie Wan, Tianxing Chen, Xianda Guo, Congsheng Xu, Yuanyang Qi, Haojie Zhang, Longyan Wu, Tianling Xu, Zixuan Li, et al. Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

arXiv 2026
[27]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.ArXiv, abs/2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, XianLian Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable da...

Pith/arXiv arXiv 2025
[28]

Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.ArXiv, abs/2603.01229, 2026

Tianxing Chen, Yuran Wang, Mingleyang Li, Yanjiao Qin, Haochen Shi, Zixuan Li, Yifan Hu, Yingsheng J Zhang, Kaixuan Wang, Yue Chen, Hongchen Wang, Renjing Xu, Ruihai Wu, Yao Mu, Yaodong Yang, Hao Dong, and Ping Luo. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.ArXiv, abs/2603.01229, 2026. URL https://api.semant...

arXiv 2026
[29]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.ArXiv, abs/2511.16651, 2025

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, and Jiangmiao Pang. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.ArXiv, abs/2511.16651, 2025. URL https://api.semanticscholar. org/CorpusID:283110021

arXiv 2025
[30]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023

Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023

2023
[31]

Mobilemanibench: Simplifying model verification for mobile manipulation

Wenbo Wang, Fangyun Wei, QiXiu Li, Xi Chen, Yaobo Liang, Chang Xu, Jiaolong Yang, and Baining Guo. Mobilemanibench: Simplifying model verification for mobile manipulation. arXiv preprint arXiv:2602.05233, 2026

arXiv 2026
[32]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

arXiv 2025
[33]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. Sonic: Supersizing motion tracking for natura...

Pith/arXiv arXiv 2025
[34]

curobov2: Dynamics- aware motion generation with depth-fused distance fields for high-dof robots, 2026

Balakumar Sundaralingam, Adithyavairavan Murali, and Stan Birchfield. curobov2: Dynamics- aware motion generation with depth-fused distance fields for high-dof robots, 2026

2026
[35]

Key-grid: Unsupervised 3d keypoints detection using grid heatmap features, 2024

Chengkai Hou, Zhengrong Xue, Bingyang Zhou, Jinghan Ke, Lin Shao, and Huazhe Xu. Key-grid: Unsupervised 3d keypoints detection using grid heatmap features, 2024. URL https://arxiv.org/abs/2410.02237

arXiv 2024
[36]

Skeleton merger: an unsupervised aligned keypoint detector

Ruoxi Shi, Zhengrong Xue, Yang You, and Cewu Lu. Skeleton merger: an unsupervised aligned keypoint detector. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 43–52, 2021

2021
[37]

Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects

Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects. ArXiv, abs/2309.07473, 2023. URL https://api.semanticscholar.org/CorpusID: 261823492

arXiv 2023
[38]

Genmanip: Llm-driven simulation for generalizable instruction-following manipulation

Ning Gao, Yilun Chen, Shuai Yang, Xinyi Chen, Yang Tian, Hao Li, Haifeng Huang, Hanqing Wang, Tai Wang, and Jiangmiao Pang. Genmanip: Llm-driven simulation for generalizable instruction-following manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12187–12198, 2025. 13

2025
[39]

Genie sim 3.0 : A high-fidelity comprehensive simulation platform for humanoid robot, 2026

Chenghao Yin, Da Huang, Di Yang, Jichao Wang, Nanshu Zhao, Chen Xu, Wenjun Sun, Linjie Hou, Zhijun Li, Junhui Wu, Zhaobo Liu, Zhen Xiao, Sheng Zhang, Lei Bao, Rui Feng, Zhenquan Pang, Jiayu Li, Qian Wang, and Maoqing Yao. Genie sim 3.0 : A high-fidelity comprehensive simulation platform for humanoid robot, 2026. URL https://arxiv.org/ abs/2601.02078

Pith/arXiv arXiv 2026
[40]

Robotwin: Dual-arm robot benchmark with generative digital twins

Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 27649–27660, June 2025

2025
[41]

Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,

Sizhe Yang, Yiman Xie, Zhixuan Liang, Yang Tian, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,
[42]

URLhttps://arxiv.org/abs/2603.05312

arXiv
[43]

Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning

Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3891–3902, 2023

2023
[44]

Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation

Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11359–11366. IEEE, 2023. URLhttps://arxiv.org/abs/2210.02697

arXiv 2023
[45]

Dexart: Benchmarking generalizable dexterous manipulation with articulated objects

Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21190–21200, 2023

2023
[46]

Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization, 2025

Jiayi Chen, Yubin Ke, and He Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization, 2025. URLhttps://arxiv.org/abs/2412.16490

arXiv 2025
[47]

Bidexgrasp: Coordinated bimanual dexterous grasps across object geometries and sizes, 2026

Mu Lin, Yi-Lin Wei, Jiaxuan Chen, Yuhao Lin, Shuoyu Chen, Jiangran Lyu, Jiayi Chen, Yansong Tang, He Wang, and Wei-Shi Zheng. Bidexgrasp: Coordinated bimanual dexterous grasps across object geometries and sizes, 2026. URL https://arxiv.org/abs/2604. 06589

2026
[48]

Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation.ArXiv, abs/2503.03045, 2025

Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia-Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation.ArXiv, abs/2503.03045, 2025. URL https: //api.semanticscholar.org/CorpusID:276781877

arXiv 2025
[49]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

2026
[50]

P3-sam: Native 3d part segmentation.arXiv preprint arXiv:2509.06784, 2025

Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, and Chunchao Guo. P3-sam: Native 3d part segmentation.arXiv preprint arXiv:2509.06784, 2025

arXiv 2025
[51]

X-part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, et al. X-part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

arXiv 2025
[52]

curobo: Parallelized collision-free minimum-jerk robot motion generation, 2023

Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. curobo: Parallelized collision-free minimum-jerk robot motion generation, 2023. URLhttps://arxiv.org/abs/2310.17274

arXiv 2023
[53]

PA3FF: Learning part-aware dense 3d feature field for generalizable articulated object manipulation

Yue Chen, Muqing Jiang, Kaifeng Zheng, Jiaqi Liang, Chenrui Tie, Haoran Lu, Ruihai Wu, and Hao Dong. PA3FF: Learning part-aware dense 3d feature field for generalizable articulated object manipulation. InThe Fourteenth International Conference on Learning Representations, 2026. doi: 10.48550/arXiv.2602.14193. URL https://openreview.net/ forum?id=qXfRXfAHO...

work page doi:10.48550/arxiv.2602.14193 2026
[54]

acl-main.485/

Ruihai Wu, Haoran Lu, Yiyan Wang, Yubo Wang, and Hao Dong. UniGarmentManip: A unified framework for category-level garment manipulation via dense visual correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16340–16350, 2024. doi: 10.1109/CVPR52733.2024.01546. URL https: //openaccess.thecvf.com/c...

work page doi:10.1109/cvpr52733.2024.01546 2024
[55]

Where2Explore: Few- shot affordance learning for unseen novel categories of articulated objects

Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. Where2Explore: Few- shot affordance learning for unseen novel categories of articulated objects. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=QLllDwizVd

2023
[56]

ImageManip: Image-based robotic manipulation with affordance-guided next view selection.arXiv preprint arXiv:2310.09069, 2023

Xiaoqi Li, Yanzi Wang, Yan Shen, Iaroslav Ponomarenko, Haoran Lu, Qianxu Wang, Boshi An, Jiaming Liu, and Hao Dong. ImageManip: Image-based robotic manipulation with affordance-guided next view selection.arXiv preprint arXiv:2310.09069, 2023. doi: 10.48550/ arXiv.2310.09069. URLhttps://arxiv.org/abs/2310.09069

arXiv 2023
[57]

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, and Han Liu. SPACENUM: Revisiting spatial numerical understanding in VLMs.arXiv preprint arXiv:2605.23898, 2026. doi: 10.48550/arXiv.2605.23898. URL https://arxiv.org/abs/ 2605.23898

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.23898 2026
[58]

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, and Han Liu. ProgressLM: Towards progress reasoning in vision-language models.arXiv preprint arXiv:2601.15224, 2026. doi: 10.48550/arXiv.2601.15224. URL https://arxiv. org/abs/2601.15224. ACL 2026 Main Conference Oral; ICLR 2026 Workshop on World Models

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.15224 2026
[59]

arXiv:2510.01586 [cs.AI] doi:10.48550/arXiv.2510.01586

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, and Han Liu. AdvEvo-MARL: Shaping in- ternalized safety through adversarial co-evolution in multi-agent reinforcement learning. arXiv preprint arXiv:2510.015...

work page doi:10.48550/arxiv.2510.01586 2025
[60]

Chang, Li Yi, Subarna Tripathi, Leonidas J

Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 909–918, 2019. URL https://openaccess.thecvf.com/content_ CVPR_201...

2019
[61]

Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding.arXiv preprint arXiv:2510.20155, 2025

Penghao Wang, Yiyang He, Xin Lv, Yukai Zhou, Lan Xu, Jingyi Yu, and Jiayuan Gu. Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding.arXiv preprint arXiv:2510.20155, 2025

arXiv 2025
[62]

Keypointnet: A large-scale 3d keypoint dataset aggregated from numerous human annotations.arXiv preprint arXiv:2002.12687, 2020

Yang You, Yujing Lou, Chengkun Li, Zhoujun Cheng, Liangwei Li, Lizhuang Ma, Cewu Lu, and Weiming Wang. Keypointnet: A large-scale 3d keypoint dataset aggregated from numerous human annotations.arXiv preprint arXiv:2002.12687, 2020

arXiv 2002
[63]

Laso: Language-guided affordance segmentation on 3d object

Yuandong Liu, Yan Wei, Yue Jiang, et al. Laso: Language-guided affordance segmentation on 3d object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[64]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The robot learning benchmark & learning environment.arXiv preprint arXiv:1909.12271, 2019. URLhttps://arxiv.org/abs/1909.12271

arXiv 1909
[65]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 1820–1864. PMLR, 2023. URLhttps:/...

2023
[66]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation

Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pa...

2024
[67]

Gensim: Generating robotic simulation tasks via large language models

Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, and Xiaolong Wang. Gensim: Generating robotic simulation tasks via large language models. InThe Twelfth International Conference on Learning Representations,
[68]

URLhttps://openreview.net/forum?id=OI3RoHoWAN
[69]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning.arXiv preprint arXiv:1910.10897, 2019

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shiv- ely, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning.arXiv preprint arXiv:1910.10897, 2019. URLhttps://arxiv.org/abs/1910.10897

arXiv 1910
[70]

Isaac gym: High performance gpu-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac gym: High performance gpu-based physics simulation for robot learning. InThirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URLht...

Pith/arXiv arXiv 2021
[71]

Learning dexterous in-hand manipulation.arXiv preprint arXiv:1808.00177, 2018

OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation.arXiv preprint arXiv:1808.00177, 2018. URL https://arxiv.org/...

Pith/arXiv arXiv 2018
[72]

Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. InRobotics: Science and Systems, 2017. URLhttps://www.roboticsproceedings.org/rss13/p58.pdf

2017
[73]

Acronym: A large-scale grasp dataset based on simulation

Clemens Eppner, Arsalan Mousavian, and Dieter Fox. Acronym: A large-scale grasp dataset based on simulation. In2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021. URL https://research.nvidia.com/publication/2021-05_ acronym-large-scale-grasp-dataset-based-simulation

2021
[74]

6-dof graspnet: Variational grasp generation for object manipulation

Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2901–2910, 2019. URL https://openaccess.thecvf. com/content_ICCV_2019/html/Mousavian_6-DOF_GraspNet_Variational_Grasp_ Generation_for_Object_Manipulation_I...

2019
[75]

Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes.arXiv preprint arXiv:2103.14127, 2021

Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, and Dieter Fox. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes.arXiv preprint arXiv:2103.14127, 2021. URLhttps://arxiv.org/abs/2103.14127

arXiv 2021
[76]

Bimanual grasp synthesis for dexterous robot hands.arXiv preprint arXiv:2411.15903, 2024

Yanming Shao and Chenxi Xiao. Bimanual grasp synthesis for dexterous robot hands.arXiv preprint arXiv:2411.15903, 2024. URLhttps://arxiv.org/abs/2411.15903

arXiv 2024
[78]

URLhttps://arxiv.org/abs/2101.02692

arXiv
[79]

Vat-mart: Learning visual action trajectory pro- posals for manipulating 3d articulated objects

Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. Vat-mart: Learning visual action trajectory pro- posals for manipulating 3d articulated objects. InThe Tenth International Conference on Learn- ing Representations, 2022. URLhttps://openreview.net/forum?id=iEx3PiooLy. 16

2022
[80]

Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts

Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7081– 7091, 2023. URL https://openaccess.thecvf.com/content/CV...

2023
[81]

O2o-afford: Annotation- free large-scale object-object affordance learning

Kaichun Mo, Yuzhe Qin, Fanbo Xiang, Hao Su, and Leonidas Guibas. O2o-afford: Annotation- free large-scale object-object affordance learning. InConference on Robot Learning, 2021. URLhttps://openreview.net/forum?id=EougVeukEH9

2021

Showing first 80 references.

[1] [1]

Pi-0.7: A steerable generalist robotic foundation model with emergent capabilities

Physical Intelligence. Pi-0.7: A steerable generalist robotic foundation model with emergent capabilities. arXiv preprint, 2026. CorpusID: 287607456

2026

[2] [2]

Gr00t n1: An open foundation model for generalist humanoid robots

NVIDIA. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[3] [3]

Gen-0: Embodied foundation models that scale with physical interaction

Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction. Generalist AI Blog, 2025. November 4, 2025

2025

[4] [4]

World action models are zero-shot policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, and Joel Jang. World action models are zero-shot policies. arXiv:2602.15922, 2026

Pith/arXiv arXiv 2026

[5] [5]

Egoscale: Scaling dexterous manipulation with diverse egocentric human data

Ruijie Zheng, Dantong Niu, Yuqi Xie, and Linxi Jim Fan. Egoscale: Scaling dexterous manipulation with diverse egocentric human data. arXiv:2602.16710, 2026

arXiv 2026

[6] [6]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

Pith/arXiv arXiv 2025

[7] [7]

Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026

Yue Chen, Muqing Jiang, Kaifeng Zheng, Jiaqi Liang, Chenrui Tie, Haoran Lu, Ruihai Wu, and Hao Dong. Learning part-aware dense 3d feature field for generalizable articulated object manipulation, 2026. URLhttps://arxiv.org/abs/2602.14193

arXiv 2026

[8] [8]

Shirwatkar, N

Ruihai Wu, Haozhe Chen, Mingtong Zhang, Haoran Lu, Yitong Li, and Yunzhu Li. Neural dynamics augmented diffusion policy. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 13234–13241, 2025. doi: 10.1109/ICRA55743.2025.11128651

work page doi:10.1109/icra55743.2025.11128651 2025

[9] [9]

Biassemble: Learning collaborative affordance for bimanual geometric assembly.ArXiv, abs/2506.06221, 2025

Yan Shen, Ruihai Wu, Yubin Ke, Xinyuan Song, Zeyi Li, Xiaoqi Li, Hongwei Fan, Haoran Lu, and Hao Dong. Biassemble: Learning collaborative affordance for bimanual geometric assembly.ArXiv, abs/2506.06221, 2025. URL https://api.semanticscholar.org/ CorpusID:279244917

arXiv 2025

[10] [10]

Broadcasting support relations recursively from local dynamics for object retrieval in clutters

Yitong Li, Ruihai Wu, Haoran Lu, Chuanruo Ning, Yan Shen, Guanqi Zhan, and Hao Dong. Broadcasting support relations recursively from local dynamics for object retrieval in clutters. ArXiv, abs/2406.02283, 2024. URL https://api.semanticscholar.org/CorpusID: 270226492

arXiv 2024

[11] [11]

Garmentpile++: Affordance-driven cluttered garments retrieval with vision-language reasoning

Mingleyang Li, Yuran Wang, Yue Chen, Tianxing Chen, Jiaqi Liang, Zishun Shen, Haoran Lu, Ruihai Wu, and Hao Dong. Garmentpile++: Affordance-driven cluttered garments retrieval with vision-language reasoning. Preprint, 2026. CorpusID: 286238271

2026

[12] [12]

Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence

Ruihai Wu, Haoran Lu, Yiyan Wang, Yubo Wang, and Hao Dong. Unigarmentmanip: A unified framework for category-level garment manipulation via dense visual correspondence. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16340–16350, 2024. URL https://api.semanticscholar.org/CorpusID:269757227

2024

[13] [13]

Learning to feel the future: Dreamtacvla for contact-rich manipulation.ArXiv, abs/2512.23864, 2025

Guo Ye, Zexi Zhang, Xu Zhao, Shang Wu, Haoran Lu, Shihan Lu, and Han Liu. Learning to feel the future: Dreamtacvla for contact-rich manipulation.ArXiv, abs/2512.23864, 2025. URLhttps://api.semanticscholar.org/CorpusID:284350273. 11

Pith/arXiv arXiv 2025

[14] [14]

Roboverse: Towards a unified platform, dataset and benchmark for scalable and generalizable robot learn- ing.ArXiv, abs/2504.18904, 2025

Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Char- lie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran G...

arXiv 2025

[15] [15]

Unigarment: A unified simulation and benchmark for garment manipulation

Haoran Lu, Yitong Li, Ruihai Wu, Chuanruo Ning, Yan Shen, and Hao Dong. Unigarment: A unified simulation and benchmark for garment manipulation. Preprint, 2024. CorpusID: 275782214

2024

[16] [16]

Garmentlab: A unified simulation and benchmark for garment manipulation.ArXiv, abs/2411.01200, 2024

Haoran Lu, Ruihai Wu, Yitong Li, Sijie Li, Ziyu Zhu, Chuanruo Ning, Yan Shen, Longzan Luo, Yuanpei Chen, and Hao Dong. Garmentlab: A unified simulation and benchmark for garment manipulation.ArXiv, abs/2411.01200, 2024. URL https://api.semanticscholar.org/ CorpusID:273811186

arXiv 2024

[17] [17]

Lightwheel kitchen: 3d kitchen asset collection for nvidia isaac sim

Lightwheel. Lightwheel kitchen: 3d kitchen asset collection for nvidia isaac sim. GitHub repos- itory, 2025. URL https://github.com/LightwheelAI/Lightwheel_Kitchen. Open- source 3D assets under CC BY-NC 4.0

2025

[18] [18]

Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Munoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, et al. Isaac lab: A gpu-accelerated simulation framework for multi-modal robot learning.arXiv preprint arXiv:2511.04831, 2025

Pith/arXiv arXiv 2025

[19] [19]

Lw-benchhub: Lightwheel’s end-to-end embodied ai simulation platform

Lightwheel Team. Lw-benchhub: Lightwheel’s end-to-end embodied ai simulation platform. GitHub repository, 2025. URLhttps://github.com/lightwheel-ai/lw_benchhub

2025

[20] [20]

Matthews, Ivan Villa-Renteria, Jerry Huayang Tang, Claire Tang, Fei Xia, Yunzhu Li, Silvio Savarese, Hyowon Gweon, C

Chengshu Li, Ruohan Zhang, Josiah Wong, Cem Gokmen, Sanjana Srivastava, Roberto Martín- Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Jose Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, Ayano Hiranaka, Sujay Garlanka, Arman Aydin, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Ca...

Pith/arXiv arXiv 2024

[21] [21]

Mind to hand: Purposeful robotic control via embodied reasoning, 2025

Peijun Tang, Shangjin Xie, Binyan Sun, Baifu Huang, Kuncheng Luo, Haotian Yang, Weiqi Jin, and Jianan Wang. Mind to hand: Purposeful robotic control via embodied reasoning, 2025. URLhttps://arxiv.org/abs/2512.08580

arXiv 2025

[22] [22]

Spatialvla: Exploring spatial representations for visual-language-action model.ArXiv, abs/2501.15830, 2025

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yani Ding, Zhigang Wang, Jiayuan Gu, Bin Zhao, Dong Wang, and Xuelong Li. Spatialvla: Exploring spatial representations for visual-language-action model.ArXiv, abs/2501.15830, 2025. URL https://api.semanticscholar.org/CorpusID:275921131

Pith/arXiv arXiv 2025

[23] [23]

Scalable trajectory generation for whole-body mobile manipulation

Yida Niu, Xinhai Chang, Xin Liu, Ziyuan Jiao, and Yixin Zhu. Scalable trajectory generation for whole-body mobile manipulation. InCVPR, 2026

2026

[24] [24]

Molmob0t: Large-scale simulation enables zero-shot manipulation

Abhay Deshpande, Maya Guru, Rose Hendrix, Snehal Jauhri, Haoquan Fang, Wilbert Pumacay, Yejin Kim, Quinn Pfeifer, Ying-Chun Lee, Piper Wolters, Omar Rayyan, Mingtong Zhang, Karen Farley, Winson Han, Eli Vanderbilt, Dieter Fox, Ali Farhadi, Georgia Chalvatzaki, Dhruv Shah, and Ranjay Krishna. Molmob0t: Large-scale simulation enables zero-shot manipulation....

2026

[25] [25]

Manitwin: Scaling data-generation-ready digital object dataset to 100k

Kaixuan Wang, Tianxing Chen, Jiawei Liu, Honghao Su, Shaolong Zhu, Minxuan Wang, Zixuan Li, Yue Chen, Huan-ang Gao, Yusen Qin, Jiawei Wang, Qixuan Zhang, Lan Xu, Jingyi Yu, Yao Mu, and Ping Luo. Manitwin: Scaling data-generation-ready digital object dataset to 100k. Preprint, 2026. CorpusID: 286583937. 12

2026

[26] [26]

Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

Baijun Chen, Weijie Wan, Tianxing Chen, Xianda Guo, Congsheng Xu, Yuanyang Qi, Haojie Zhang, Longyan Wu, Tianling Xu, Zixuan Li, et al. Univtac: A unified simulation platform for visuo-tactile manipulation data generation, learning, and benchmarking.arXiv preprint arXiv:2602.10093, 2026

arXiv 2026

[27] [27]

Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.ArXiv, abs/2506.18088, 2025

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Qiwei Liang, Zixuan Li, XianLian Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable da...

Pith/arXiv arXiv 2025

[28] [28]

Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.ArXiv, abs/2603.01229, 2026

Tianxing Chen, Yuran Wang, Mingleyang Li, Yanjiao Qin, Haochen Shi, Zixuan Li, Yifan Hu, Yingsheng J Zhang, Kaixuan Wang, Yue Chen, Hongchen Wang, Renjing Xu, Ruihai Wu, Yao Mu, Yaodong Yang, Hao Dong, and Ping Luo. Rmbench: Memory-dependent robotic manipulation benchmark with insights into policy design.ArXiv, abs/2603.01229, 2026. URL https://api.semant...

arXiv 2026

[29] [29]

Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.ArXiv, abs/2511.16651, 2025

Yang Tian, Yuyin Yang, Yiman Xie, Zetao Cai, Xu Shi, Ning Gao, Hangxu Liu, Xuekun Jiang, Zherui Qiu, Feng Yuan, Yaping Li, Ping Wang, Junhao Cai, Jia Zeng, Hao Dong, and Jiangmiao Pang. Interndata-a1: Pioneering high-fidelity synthetic data for pre-training generalist policy.ArXiv, abs/2511.16651, 2025. URL https://api.semanticscholar. org/CorpusID:283110021

arXiv 2025

[30] [30]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023

Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation, 2023

2023

[31] [31]

Mobilemanibench: Simplifying model verification for mobile manipulation

Wenbo Wang, Fangyun Wei, QiXiu Li, Xi Chen, Yaobo Liang, Chang Xu, Jiaolong Yang, and Baining Guo. Mobilemanibench: Simplifying model verification for mobile manipulation. arXiv preprint arXiv:2602.05233, 2026

arXiv 2026

[32] [32]

Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. Homie: Humanoid loco-manipulation with isomorphic exoskeleton cockpit.arXiv preprint arXiv:2502.13013, 2025

arXiv 2025

[33] [33]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li, Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da, Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Tairan He, Haoru Xue, Wenli Xiao, Zi Wang, Simon Yuen, Jan Kautz, Yan Chang, Umar Iqbal, Linxi Fan, and Yuke Zhu. Sonic: Supersizing motion tracking for natura...

Pith/arXiv arXiv 2025

[34] [34]

curobov2: Dynamics- aware motion generation with depth-fused distance fields for high-dof robots, 2026

Balakumar Sundaralingam, Adithyavairavan Murali, and Stan Birchfield. curobov2: Dynamics- aware motion generation with depth-fused distance fields for high-dof robots, 2026

2026

[35] [35]

Key-grid: Unsupervised 3d keypoints detection using grid heatmap features, 2024

Chengkai Hou, Zhengrong Xue, Bingyang Zhou, Jinghan Ke, Lin Shao, and Huazhe Xu. Key-grid: Unsupervised 3d keypoints detection using grid heatmap features, 2024. URL https://arxiv.org/abs/2410.02237

arXiv 2024

[36] [36]

Skeleton merger: an unsupervised aligned keypoint detector

Ruoxi Shi, Zhengrong Xue, Yang You, and Cewu Lu. Skeleton merger: an unsupervised aligned keypoint detector. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 43–52, 2021

2021

[37] [37]

Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects

Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. Where2explore: Few-shot affordance learning for unseen novel categories of articulated objects. ArXiv, abs/2309.07473, 2023. URL https://api.semanticscholar.org/CorpusID: 261823492

arXiv 2023

[38] [38]

Genmanip: Llm-driven simulation for generalizable instruction-following manipulation

Ning Gao, Yilun Chen, Shuai Yang, Xinyi Chen, Yang Tian, Hao Li, Haifeng Huang, Hanqing Wang, Tai Wang, and Jiangmiao Pang. Genmanip: Llm-driven simulation for generalizable instruction-following manipulation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12187–12198, 2025. 13

2025

[39] [39]

Genie sim 3.0 : A high-fidelity comprehensive simulation platform for humanoid robot, 2026

Chenghao Yin, Da Huang, Di Yang, Jichao Wang, Nanshu Zhao, Chen Xu, Wenjun Sun, Linjie Hou, Zhijun Li, Junhui Wu, Zhaobo Liu, Zhen Xiao, Sheng Zhang, Lei Bao, Rui Feng, Zhenquan Pang, Jiayu Li, Qian Wang, and Maoqing Yao. Genie sim 3.0 : A high-fidelity comprehensive simulation platform for humanoid robot, 2026. URL https://arxiv.org/ abs/2601.02078

Pith/arXiv arXiv 2026

[40] [40]

Robotwin: Dual-arm robot benchmark with generative digital twins

Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 27649–27660, June 2025

2025

[41] [41]

Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,

Sizhe Yang, Yiman Xie, Zhixuan Liang, Yang Tian, Jia Zeng, Dahua Lin, and Jiangmiao Pang. Ultradexgrasp: Learning universal dexterous grasping for bimanual robots with synthetic data,

[42] [42]

URLhttps://arxiv.org/abs/2603.05312

arXiv

[43] [43]

Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning

Weikang Wan, Haoran Geng, Yun Liu, Zikang Shan, Yaodong Yang, Li Yi, and He Wang. Unidexgrasp++: Improving dexterous grasping policy learning via geometry-aware curriculum and iterative generalist-specialist learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3891–3902, 2023

2023

[44] [44]

Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation

Ruicheng Wang, Jialiang Zhang, Jiayi Chen, Yinzhen Xu, Puhao Li, Tengyu Liu, and He Wang. Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. In2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11359–11366. IEEE, 2023. URLhttps://arxiv.org/abs/2210.02697

arXiv 2023

[45] [45]

Dexart: Benchmarking generalizable dexterous manipulation with articulated objects

Chen Bao, Helin Xu, Yuzhe Qin, and Xiaolong Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21190–21200, 2023

2023

[46] [46]

Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization, 2025

Jiayi Chen, Yubin Ke, and He Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization, 2025. URLhttps://arxiv.org/abs/2412.16490

arXiv 2025

[47] [47]

Bidexgrasp: Coordinated bimanual dexterous grasps across object geometries and sizes, 2026

Mu Lin, Yi-Lin Wei, Jiaxuan Chen, Yuhao Lin, Shuoyu Chen, Jiangran Lyu, Jiayi Chen, Yansong Tang, He Wang, and Wei-Shi Zheng. Bidexgrasp: Coordinated bimanual dexterous grasps across object geometries and sizes, 2026. URL https://arxiv.org/abs/2604. 06589

2026

[48] [48]

Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation.ArXiv, abs/2503.03045, 2025

Yufei Wang, Ziyu Wang, Mino Nakura, Pratik Bhowal, Chia-Liang Kuo, Yi-Ting Chen, Zackory Erickson, and David Held. Articubot: Learning universal articulated object ma- nipulation policy via large scale simulation.ArXiv, abs/2503.03045, 2025. URL https: //api.semanticscholar.org/CorpusID:276781877

arXiv 2025

[49] [49]

Qwen3.5-omni technical report, 2026

Qwen Team. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604. 15804

2026

[50] [50]

P3-sam: Native 3d part segmentation.arXiv preprint arXiv:2509.06784, 2025

Changfeng Ma, Yang Li, Xinhao Yan, Jiachen Xu, Yunhan Yang, Chunshi Wang, Zibo Zhao, Yanwen Guo, Zhuo Chen, and Chunchao Guo. P3-sam: Native 3d part segmentation.arXiv preprint arXiv:2509.06784, 2025

arXiv 2025

[51] [51]

X-part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

Xinhao Yan, Jiachen Xu, Yang Li, Changfeng Ma, Yunhan Yang, Chunshi Wang, Zibo Zhao, Zeqiang Lai, Yunfei Zhao, Zhuo Chen, et al. X-part: High fidelity and structure coherent shape decomposition.arXiv preprint arXiv:2509.08643, 2025

arXiv 2025

[52] [52]

curobo: Parallelized collision-free minimum-jerk robot motion generation, 2023

Balakumar Sundaralingam, Siva Kumar Sastry Hari, Adam Fishman, Caelan Garrett, Karl Van Wyk, Valts Blukis, Alexander Millane, Helen Oleynikova, Ankur Handa, Fabio Ramos, Nathan Ratliff, and Dieter Fox. curobo: Parallelized collision-free minimum-jerk robot motion generation, 2023. URLhttps://arxiv.org/abs/2310.17274

arXiv 2023

[53] [53]

PA3FF: Learning part-aware dense 3d feature field for generalizable articulated object manipulation

Yue Chen, Muqing Jiang, Kaifeng Zheng, Jiaqi Liang, Chenrui Tie, Haoran Lu, Ruihai Wu, and Hao Dong. PA3FF: Learning part-aware dense 3d feature field for generalizable articulated object manipulation. InThe Fourteenth International Conference on Learning Representations, 2026. doi: 10.48550/arXiv.2602.14193. URL https://openreview.net/ forum?id=qXfRXfAHO...

work page doi:10.48550/arxiv.2602.14193 2026

[54] [54]

acl-main.485/

Ruihai Wu, Haoran Lu, Yiyan Wang, Yubo Wang, and Hao Dong. UniGarmentManip: A unified framework for category-level garment manipulation via dense visual correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16340–16350, 2024. doi: 10.1109/CVPR52733.2024.01546. URL https: //openaccess.thecvf.com/c...

work page doi:10.1109/cvpr52733.2024.01546 2024

[55] [55]

Where2Explore: Few- shot affordance learning for unseen novel categories of articulated objects

Chuanruo Ning, Ruihai Wu, Haoran Lu, Kaichun Mo, and Hao Dong. Where2Explore: Few- shot affordance learning for unseen novel categories of articulated objects. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=QLllDwizVd

2023

[56] [56]

ImageManip: Image-based robotic manipulation with affordance-guided next view selection.arXiv preprint arXiv:2310.09069, 2023

Xiaoqi Li, Yanzi Wang, Yan Shen, Iaroslav Ponomarenko, Haoran Lu, Qianxu Wang, Boshi An, Jiaming Liu, and Hao Dong. ImageManip: Image-based robotic manipulation with affordance-guided next view selection.arXiv preprint arXiv:2310.09069, 2023. doi: 10.48550/ arXiv.2310.09069. URLhttps://arxiv.org/abs/2310.09069

arXiv 2023

[57] [57]

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Jianshu Zhang, Yijiang Li, Huifeixin Chen, Haoran Lu, Letian Xue, Bingyang Wang, and Han Liu. SPACENUM: Revisiting spatial numerical understanding in VLMs.arXiv preprint arXiv:2605.23898, 2026. doi: 10.48550/arXiv.2605.23898. URL https://arxiv.org/abs/ 2605.23898

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.23898 2026

[58] [58]

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, and Han Liu. ProgressLM: Towards progress reasoning in vision-language models.arXiv preprint arXiv:2601.15224, 2026. doi: 10.48550/arXiv.2601.15224. URL https://arxiv. org/abs/2601.15224. ACL 2026 Main Conference Oral; ICLR 2026 Workshop on World Models

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.15224 2026

[59] [59]

arXiv:2510.01586 [cs.AI] doi:10.48550/arXiv.2510.01586

Zhenyu Pan, Yiting Zhang, Zhuo Liu, Yolo Yunlong Tang, Zeliang Zhang, Haozheng Luo, Yuwei Han, Jianshu Zhang, Dennis Wu, Hong-Yu Chen, Haoran Lu, Haoyang Fang, Manling Li, Chenliang Xu, Philip S. Yu, and Han Liu. AdvEvo-MARL: Shaping in- ternalized safety through adversarial co-evolution in multi-agent reinforcement learning. arXiv preprint arXiv:2510.015...

work page doi:10.48550/arxiv.2510.01586 2025

[60] [60]

Chang, Li Yi, Subarna Tripathi, Leonidas J

Kaichun Mo, Shilin Zhu, Angel X. Chang, Li Yi, Subarna Tripathi, Leonidas J. Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 909–918, 2019. URL https://openaccess.thecvf.com/content_ CVPR_201...

2019

[61] [61]

Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding.arXiv preprint arXiv:2510.20155, 2025

Penghao Wang, Yiyang He, Xin Lv, Yukai Zhou, Lan Xu, Jingyi Yu, and Jiayuan Gu. Partnext: A next-generation dataset for fine-grained and hierarchical 3d part understanding.arXiv preprint arXiv:2510.20155, 2025

arXiv 2025

[62] [62]

Keypointnet: A large-scale 3d keypoint dataset aggregated from numerous human annotations.arXiv preprint arXiv:2002.12687, 2020

Yang You, Yujing Lou, Chengkun Li, Zhoujun Cheng, Liangwei Li, Lizhuang Ma, Cewu Lu, and Weiming Wang. Keypointnet: A large-scale 3d keypoint dataset aggregated from numerous human annotations.arXiv preprint arXiv:2002.12687, 2020

arXiv 2002

[63] [63]

Laso: Language-guided affordance segmentation on 3d object

Yuandong Liu, Yan Wei, Yue Jiang, et al. Laso: Language-guided affordance segmentation on 3d object. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[64] [64]

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J. Davison. Rlbench: The robot learning benchmark & learning environment.arXiv preprint arXiv:1909.12271, 2019. URLhttps://arxiv.org/abs/1909.12271

arXiv 1909

[65] [65]

Mimicgen: A data generation system for scalable robot learning using human demonstrations

Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Iretiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, and Dieter Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. InProceedings of The 7th Conference on Robot Learning, volume 229 ofProceedings of Machine Learning Research, pages 1820–1864. PMLR, 2023. URLhttps:/...

2023

[66] [66]

Robogen: Towards unleashing infinite data for automated robot learning via generative simulation

Yufei Wang, Zhou Xian, Feng Chen, Tsun-Hsuan Wang, Yian Wang, Katerina Fragkiadaki, Zackory Erickson, David Held, and Chuang Gan. Robogen: Towards unleashing infinite data for automated robot learning via generative simulation. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pa...

2024

[67] [67]

Gensim: Generating robotic simulation tasks via large language models

Lirui Wang, Yiyang Ling, Zhecheng Yuan, Mohit Shridhar, Chen Bao, Yuzhe Qin, Bailin Wang, Huazhe Xu, and Xiaolong Wang. Gensim: Generating robotic simulation tasks via large language models. InThe Twelfth International Conference on Learning Representations,

[68] [68]

URLhttps://openreview.net/forum?id=OI3RoHoWAN

[69] [69]

Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning.arXiv preprint arXiv:1910.10897, 2019

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Avnish Narayan, Hayden Shiv- ely, Adithya Bellathur, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning.arXiv preprint arXiv:1910.10897, 2019. URLhttps://arxiv.org/abs/1910.10897

arXiv 1910

[70] [70]

Isaac gym: High performance gpu-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, and Gavriel State. Isaac gym: High performance gpu-based physics simulation for robot learning. InThirty-Fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021. URLht...

Pith/arXiv arXiv 2021

[71] [71]

Learning dexterous in-hand manipulation.arXiv preprint arXiv:1808.00177, 2018

OpenAI, Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Józefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, Jonas Schneider, Szymon Sidor, Josh Tobin, Peter Welinder, Lilian Weng, and Wojciech Zaremba. Learning dexterous in-hand manipulation.arXiv preprint arXiv:1808.00177, 2018. URL https://arxiv.org/...

Pith/arXiv arXiv 2018

[72] [72]

Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics

Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. InRobotics: Science and Systems, 2017. URLhttps://www.roboticsproceedings.org/rss13/p58.pdf

2017

[73] [73]

Acronym: A large-scale grasp dataset based on simulation

Clemens Eppner, Arsalan Mousavian, and Dieter Fox. Acronym: A large-scale grasp dataset based on simulation. In2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021. URL https://research.nvidia.com/publication/2021-05_ acronym-large-scale-grasp-dataset-based-simulation

2021

[74] [74]

6-dof graspnet: Variational grasp generation for object manipulation

Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2901–2910, 2019. URL https://openaccess.thecvf. com/content_ICCV_2019/html/Mousavian_6-DOF_GraspNet_Variational_Grasp_ Generation_for_Object_Manipulation_I...

2019

[75] [75]

Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes.arXiv preprint arXiv:2103.14127, 2021

Martin Sundermeyer, Arsalan Mousavian, Rudolph Triebel, and Dieter Fox. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes.arXiv preprint arXiv:2103.14127, 2021. URLhttps://arxiv.org/abs/2103.14127

arXiv 2021

[76] [76]

Bimanual grasp synthesis for dexterous robot hands.arXiv preprint arXiv:2411.15903, 2024

Yanming Shao and Chenxi Xiao. Bimanual grasp synthesis for dexterous robot hands.arXiv preprint arXiv:2411.15903, 2024. URLhttps://arxiv.org/abs/2411.15903

arXiv 2024

[77] [78]

URLhttps://arxiv.org/abs/2101.02692

arXiv

[78] [79]

Vat-mart: Learning visual action trajectory pro- posals for manipulating 3d articulated objects

Ruihai Wu, Yan Zhao, Kaichun Mo, Zizheng Guo, Yian Wang, Tianhao Wu, Qingnan Fan, Xuelin Chen, Leonidas Guibas, and Hao Dong. Vat-mart: Learning visual action trajectory pro- posals for manipulating 3d articulated objects. InThe Tenth International Conference on Learn- ing Representations, 2022. URLhttps://openreview.net/forum?id=iEx3PiooLy. 16

2022

[79] [80]

Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts

Haoran Geng, Helin Xu, Chengyang Zhao, Chao Xu, Li Yi, Siyuan Huang, and He Wang. Gapartnet: Cross-category domain-generalizable object perception and manipulation via generalizable and actionable parts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7081– 7091, 2023. URL https://openaccess.thecvf.com/content/CV...

2023

[80] [81]

O2o-afford: Annotation- free large-scale object-object affordance learning

Kaichun Mo, Yuzhe Qin, Fanbo Xiang, Hao Su, and Leonidas Guibas. O2o-afford: Annotation- free large-scale object-object affordance learning. InConference on Robot Learning, 2021. URLhttps://openreview.net/forum?id=EougVeukEH9

2021