Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Dongyun Ge; Guocai Yao; Guoxuan Chi; Hang Zhao; Hao Dong; Hao Zhao; Hongyang Li; Huan-ang Gao; Huazhe Xu; Jianyu Chen

arxiv: 2605.18722 · v1 · pith:DVX6BQIHnew · submitted 2026-05-18 · 💻 cs.RO

Dexora: Open-source VLA for High-DoF Bimanual Dexterity

Zongzheng Zhang , Jingrui Pang , Zhuo Yang , Kun Li , Minwen Liao , Saining Zhang , Guoxuan Chi , Jinbang Guo

show 17 more authors

Huan-ang Gao Modi Shi Dongyun Ge Yao Mu Jiayuan Gu Rui Chen Hao Dong Huazhe Xu Li Yi Yixin Zhu Hang Zhao Pengwei Wang Shanghang Zhang Guocai Yao Jianyu Chen Hongyang Li Hao Zhao

This is my paper

Pith reviewed 2026-05-20 09:39 UTC · model grok-4.3

classification 💻 cs.RO

keywords vision-language-actionbimanual dexterityrobotic manipulationdexterous handsteleoperationdiffusion policydual-arm controlopen-source VLA

0 comments

The pith

Dexora shows that a quality-weighted VLA trained on matched synthetic and real teleoperation data can learn effective high-DoF bimanual dexterity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dexora, the first open-source vision-language-action model built for dual-arm dual-hand high-DoF manipulation. It collects large embodiment-matched datasets through a hybrid teleoperation pipeline that separates arm kinematics from finger motion and drives both a physical robot and its MuJoCo twin. Training uses an offline discriminator to assign clip-level weights that down-weight noisy demonstrations inside a diffusion-transformer policy. This produces 90 percent success on basic tasks and 66.7 percent average success on dexterous benchmarks, beating prior VLA baselines while also generalizing to new distributions and different embodiments. A sympathetic reader would care because most existing VLAs remain limited to low-DoF grippers or single arms, and high-DoF bimanual control is a necessary step toward robots that can perform complex, human-scale tasks.

Core claim

By combining a hybrid exoskeleton-and-vision teleoperation interface with a 100K-trajectory synthetic corpus and 10K real episodes, then training a diffusion-transformer policy under clip-level weights from an offline discriminator, Dexora creates an effective open-source VLA for high-DoF bimanual dexterity that outperforms competitive baselines on both basic and dexterous benchmarks, reaches 90 percent success on basic tasks, and exhibits robust out-of-distribution and cross-embodiment generalization.

What carries the argument

The data-quality-aware training recipe that uses an offline discriminator to supply clip-level weights for down-weighting low-quality teleoperation demonstrations inside diffusion-transformer policy training on combined synthetic and real data.

If this is right

Reaches 90 percent success on basic manipulation tasks.
Achieves 66.7 percent average success on dexterous benchmarks versus 51.7 percent for prior VLA baselines.
Demonstrates robust generalization to out-of-distribution inputs and to different robot embodiments.
Ablations confirm that both the inclusion of real data and the discriminator weighting are necessary for high dexterity performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open-sourcing the full system and datasets could let other groups extend the approach to additional high-DoF platforms without starting from scratch.
The hybrid teleoperation pipeline could be reused to collect higher-quality data for tasks requiring even finer finger coordination than the current benchmarks.
Further increases in the volume of real-world episodes might produce additional gains in generalization beyond the out-of-distribution cases already tested.

Load-bearing premise

The offline discriminator can reliably assign clip-level weights that reduce the effect of noisy teleoperation demonstrations enough for the policy to learn high-DoF bimanual control from the mixed datasets.

What would settle it

Training the same diffusion-transformer policy on the combined datasets without the discriminator weights and measuring whether average dexterous success falls to or below the reported 51.7 percent baseline level.

Figures

Figures reproduced from arXiv: 2605.18722 by Dongyun Ge, Guocai Yao, Guoxuan Chi, Hang Zhao, Hao Dong, Hao Zhao, Hongyang Li, Huan-ang Gao, Huazhe Xu, Jianyu Chen, Jiayuan Gu, Jinbang Guo, Jingrui Pang, Kun Li, Li Yi, Minwen Liao, Modi Shi, Pengwei Wang, Rui Chen, Saining Zhang, Shanghang Zhang, Yao Mu, Yixin Zhu, Zhuo Yang, Zongzheng Zhang.

**Figure 1.** Figure 1: Dexora overview. (a) Motivation: Three illustrative contrasts highlight the need for dual-arm, dual-hand dexterous VLA: piston insertion (requires two arms), book retrieval from a packed shelf (hands with fingers succeed where grippers fail), and bottle opening (12-DoF fingers with lateral swing outperform 6-DoF). (b) Dataset (§III-B): We pretrain on 100K simulated bimanual-hand trajectories and post-train… view at source ↗

**Figure 2.** Figure 2: Comparison of embodiment coverage. Prior works cover either single-arm or low-DoF dual-arm settings. Dexora is the first system positioned in the dual-arm, high-DoF dexterous region, while also generalizing across simpler embodiments without re-architecture. task diversity, while real data provides fine-grained realism essential for high-DoF bimanual dexterity. Together, this dataset establishes a foundat… view at source ↗

**Figure 3.** Figure 3: Hardware and teleoperation system. (a) Hybrid teleoperation interface and 12-DoF XHAND. (b)-(c) The operator teleoperates the physical robot and its MujoCo digital twin, so apple→plate demonstrations are collected in real and simulation under the same interface, thereby reducing the sim-to-real gap. III. DEXORA In this section, we first introduce the hardware setup and teleoperation system (Sec. III-A), fo… view at source ↗

**Figure 4.** Figure 4: Dataset demonstration. (a) Simulation objects subset: our simulator includes 297 objects across 30 categories. (b) Real-world objects (347 objects, 17 categories), covering both basic and dexterous use cases. (c) Per-family task distribution in simulation vs. real. The simulation data only includes basic tasks, while the real-world set shifts weight toward dexterity (20%). (d) Trajectory counts per family … view at source ↗

**Figure 5.** Figure 5: Dexora framework. (a) Data filtering: From the real-world dataset we pre-screen demonstrations by kinematic smoothness (low acceleration and jerk), then replay them for post-validation and keep the clips that complete the task without collisions, forming a high-quality subset. (b) Discriminator training: With the pretrained diffusion–transformer policy frozen, we compute a log-π proxy for each clip and tra… view at source ↗

**Figure 6.** Figure 6: Basic tasks suite. (a) Pick and Place (5 tasks). (b) Assemble/Disassemble (5 tasks). (c) Articulated Objects (2 tasks). TABLE I BASIC TASKS EVALUATION. RESULTS ARE SUCCESS RATES (%) OVER 20 TRIALS. GRAY COLUMNS INDICATE BIMANUAL TASKS. Method Pick and Place Assemble / Disassemble Articulated Object Avg. Apple → plate Bowl → bowl Two eggs → box Lift basket Left block → right plate Stack ring blocks Grab squ… view at source ↗

**Figure 7.** Figure 7: Dexterous manipulation sequences. (a) Use Pen: The left hand picks up the pen (#1), hands it to the right hand (#2); the right thumb depresses the tip (#3) and writes on paper (#4). (b) Cut Leek: The right hand grasps the knife (#1), the left hand stabilizes the leek (#2); the right hand slices (#3) and returns the knife to the table (#4). (c) Rough Dough: Both hands press the rolling pin simultaneously (#… view at source ↗

**Figure 9.** Figure 9: Cross-embodiment generalization. The Dexora policy transfers to (a) single-arm gripper, (b) dual-arm grippers, and (c) single-arm singlehand, completing representative tasks like a three-step pepper handover. apple plate stack ring blocks Use pen Cut leek 0 20 40 60 80 100 Success rate (%) 75 90 100 65 80 85 0 35 65 10 60 80 Sim Only Sim + 50% Real Sim + All Real [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dexora ships the first open VLA for native dual-arm dual-hand high-DoF control with a hybrid teleop pipeline and a discriminator weighting step, but the evidence that the weighting itself drives the gains is still thin.

read the letter

The main thing to know is that this paper puts out the first open-source VLA built from the start for dual-arm dual-hand high-DoF manipulation, along with a sizable mixed dataset and a practical way to collect the data. That combination is useful on its own even before the numbers are examined closely. The hybrid teleoperation setup splits gross arm motion from an exoskeleton backpack and fine finger motion from Vision Pro tracking, then drives both the real dual-arm dual-hand platform and a matching MuJoCo twin. They collected 100K synthetic trajectories and 10K real teleoperated episodes, then trained a diffusion-transformer policy with clip-level weights from an offline discriminator meant to down-weight noisy demonstrations. The reported results are competitive: 90% success on basic tasks and 66.7% average on dexterous ones versus 51.7% for the baselines, plus some out-of-distribution and cross-embodiment generalization. The open release of code, data, and interfaces is the clearest contribution here. The soft spots sit mostly around the discriminator claim. The abstract says ablations show the weighting matters, yet it gives no histograms, no qualitative examples of high- versus low-weight clips, and no correlation checks against human judgments of noise or trajectory quality. Without those, it remains possible that the performance edge comes from dataset scale, the teleoperation interface, or model capacity rather than the weighting recipe itself. Baseline details and task definitions are also light in the summary, which makes the comparisons harder to assess. This paper is for groups already working on embodied VLAs or dexterous bimanual systems who need a concrete open starting point rather than another closed model. Readers who want to run or extend high-DoF manipulation experiments will get immediate value from the datasets and interfaces. It has enough scale and novelty to deserve a serious referee, even if the current version needs tighter evidence on the training method.

Referee Report

2 major / 2 minor

Summary. Dexora presents the first open-source VLA model for high-DoF bimanual dexterity with dual arms and dual hands. It introduces a hybrid teleoperation pipeline (exoskeleton for gross arm motion, Apple Vision Pro for finger tracking) that drives both a physical platform and MuJoCo twin, assembles a 100K-trajectory synthetic corpus plus 10K real teleoperated episodes, and trains a diffusion-transformer policy with an offline discriminator that supplies clip-level weights to down-weight noisy demonstrations. The paper reports 90% success on basic tasks, 66.7% average dexterous success (vs. 51.7% for competitive VLA baselines), plus robust OOD and cross-embodiment generalization, with ablations attributing gains to real data and the discriminator.

Significance. If the empirical claims hold under rigorous controls, this would be a notable contribution as the first open-source VLA explicitly targeting high-dimensional dual-arm/dual-hand control. The hybrid data recipe and quality-aware weighting address a practical bottleneck in scaling dexterous policies; reproducible open-source release of model, data, and interface would further amplify impact on embodied AI research.

major comments (2)

The central performance claim (66.7% dexterous success vs. 51.7% baselines, plus OOD/cross-embodiment gains) rests on the offline discriminator producing clip-level weights that meaningfully down-weight noisy teleoperation episodes. The abstract states that ablations confirm the discriminator's importance, yet the manuscript provides no direct validation of this mechanism: no weight histograms, no qualitative examples of high- versus low-weight clips, and no reported correlation between assigned weights and independent quality metrics such as trajectory smoothness or human-rated task completion. Without these checks it remains possible that reported gains derive from dataset scale, the hybrid interface, or model capacity rather than the weighting itself.
Baseline comparisons and statistical reporting lack necessary detail for reproducibility and fairness. The abstract cites concrete success-rate deltas but supplies no information on how the competitive VLA baselines were implemented (e.g., exact architectures, training hyperparameters, or whether they received the same mixed synthetic+real corpus), no statistical tests or confidence intervals on the reported percentages, and no precise definitions of the basic versus dexterous task suites or success criteria.

minor comments (2)

Notation for the discriminator output (clip-level weights) should be introduced with an explicit equation or pseudocode block so readers can trace how weights are applied inside the diffusion-transformer loss.
Figure captions for the teleoperation interface and dataset examples would benefit from additional labels indicating which components are synthetic versus real and which arm/hand DoFs are being controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and outline the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: The central performance claim (66.7% dexterous success vs. 51.7% baselines, plus OOD/cross-embodiment gains) rests on the offline discriminator producing clip-level weights that meaningfully down-weight noisy teleoperation episodes. The abstract states that ablations confirm the discriminator's importance, yet the manuscript provides no direct validation of this mechanism: no weight histograms, no qualitative examples of high- versus low-weight clips, and no reported correlation between assigned weights and independent quality metrics such as trajectory smoothness or human-rated task completion. Without these checks it remains possible that reported gains derive from dataset scale, the hybrid interface, or model capacity rather than the weighting itself.

Authors: We agree that direct evidence for the discriminator's weighting mechanism would strengthen the claims. Our existing ablations show performance drops when the weighting is removed, but we acknowledge the absence of supporting visualizations and correlations. In the revised manuscript we will add: (i) histograms of clip-level weights across the real dataset, (ii) qualitative examples of high- versus low-weight trajectories with corresponding smoothness metrics, and (iii) a correlation analysis between assigned weights and independent quality indicators (trajectory jerk and human-rated task completion on a held-out subset). These additions will help isolate the contribution of the weighting from dataset scale and model capacity. revision: yes
Referee: Baseline comparisons and statistical reporting lack necessary detail for reproducibility and fairness. The abstract cites concrete success-rate deltas but supplies no information on how the competitive VLA baselines were implemented (e.g., exact architectures, training hyperparameters, or whether they received the same mixed synthetic+real corpus), no statistical tests or confidence intervals on the reported percentages, and no precise definitions of the basic versus dexterous task suites or success criteria.

Authors: We accept that additional implementation and statistical details are required for reproducibility. In the revised version we will expand the experimental section to include: (1) exact architectures, training hyperparameters, and data corpus details for each baseline (confirming use of the same mixed synthetic+real data where applicable), (2) precise definitions of the basic and dexterous task suites together with success criteria, and (3) statistical significance tests with 95% confidence intervals computed over multiple random seeds for all reported success rates. These changes will ensure fair and reproducible comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on experimental training and evaluation, not on any derivation that reduces to inputs by construction.

full rationale

The paper describes an empirical pipeline: hybrid teleoperation data collection (synthetic 100K trajectories + real 10K episodes), an offline discriminator that assigns clip-level weights, and training of a diffusion-transformer policy whose success rates (66.7% dexterous, 90% basic) are measured on held-out benchmarks. No equations, uniqueness theorems, or first-principles derivations are invoked whose outputs are definitionally equivalent to the fitted weights or the training corpus itself. The discriminator weighting is a modeling choice whose effectiveness is asserted via ablations, but the reported numbers are direct experimental outcomes rather than predictions forced by the paper's own definitions or self-citations. The derivation chain is therefore self-contained as standard supervised learning on collected data.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the hybrid teleoperation interface and the discriminator weighting scheme. No new physical entities are postulated. The main unstated premises are that the collected demonstrations contain sufficient signal for high-DoF learning once low-quality clips are down-weighted and that simulation-to-real transfer works for the reported tasks.

free parameters (1)

clip-level discriminator weights
Learned or tuned weights that scale the contribution of each demonstration segment during diffusion policy training.

axioms (1)

domain assumption Hybrid exoskeleton-plus-markerless tracking produces demonstrations whose quality distribution can be meaningfully scored by an offline discriminator
The data-quality-aware training recipe depends on this premise to mitigate noise in the 10K real episodes.

pith-pipeline@v0.9.0 · 5912 in / 1491 out tokens · 48263 ms · 2026-05-20T09:39:56.542734+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 10 internal anchors

[1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, PMLR, 2023

work page 2023
[2]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Open- vla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai,et al., “π 0.5: a vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

GR-3 Technical Report

C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Ma,et al., “Gr-3 technical report,”arXiv preprint arXiv:2507.15493, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang,et al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”RSS, 2023

work page 2023
[9]

Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,

A. . Team, “Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,”arXiv preprint arXiv:2405.02292, 2024

work page arXiv 2024
[10]

Open teach: A versatile teleoperation system for robotic manipulation,

A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,”CoRL, 2024

work page 2024
[11]

Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,

R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang, “Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,”arXiv preprint arXiv:2407.03162, 2024

work page arXiv 2024
[12]

Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,

Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox, “Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,”RSS, 2023

work page 2023
[13]

Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,

S. Li, X. Ma, H. Liang, M. G ¨orner, P. Ruppel, B. Fang, F. Sun, and J. Zhang, “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,”ICRA, 2019

work page 2019
[14]

Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,

H. Fang, H.-S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,”ICRA, 2024

work page 2024
[15]

Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song, “Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,”CoRL, 2025

work page 2025
[16]

Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,

A. Imdieke and K. Desingh, “Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,”arXiv preprint arXiv:2504.05488, 2025

work page arXiv 2025
[17]

Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,

P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,” IROS, 2024

work page 2024
[18]

How to train your robots? the impact of demonstration modality on imitation learning,

H. Li, Y . Cui, and D. Sadigh, “How to train your robots? the impact of demonstration modality on imitation learning,”arXiv preprint arXiv:2503.07017, 2025

work page arXiv 2025
[19]

Language Conditioned Multi-Finger Dexterous Manipulation Enabled by Physical Compliance and Switching of Controllers

C. Pan, K. Junge, and J. Hughes, “Vision-language-action model and diffusion policy switching enables dexterous control of an anthropo- morphic hand,”arXiv preprint arXiv:2410.14022, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,

J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang, “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in8th CoRL, 2024

work page 2024
[21]

Dex1b: Learning with 1b demonstrations for dexterous manipulation,

J. Ye, K. Wang, C. Yuan, R. Yang, Y . Li, J. Zhu, Y . Qin, X. Zou, and X. Wang, “Dex1b: Learning with 1b demonstrations for dexterous manipulation,”arXiv preprint arXiv:2506.17198, 2025

work page arXiv 2025
[22]

G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,

Y . Ye, A. Gupta, K. Kitani, and S. Tulsiani, “G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,” in CVPR, pp. 1911–1920, 2024

work page 1911
[23]

Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,

Y . Zhong, Q. Jiang, J. Yu, and Y . Ma, “Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,” in CVPR, pp. 22584–22594, 2025

work page 2025
[24]

Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,

Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen,et al., “Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,” inCVPR, pp. 4737–4746, 2023

work page 2023
[25]

Semgrasp: Semantic grasp generation via language aligned discretization,

K. Li, J. Wang, L. Yang, C. Lu, and B. Dai, “Semgrasp: Semantic grasp generation via language aligned discretization,” inECCV, 2024

work page 2024
[26]

Realdex: Towards human-like grasping for robotic dexterous hand,

Y . Liu, Y . Yang, Y . Wang, X. Wu, J. Wang, Y . Yao, S. Schwertfeger, S. Yang, W. Wang, J. Yu,et al., “Realdex: Towards human-like grasping for robotic dexterous hand,”arXiv:2402.13853, 2024

work page arXiv 2024
[27]

Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,

J. Chen, Y . Ke, L. Peng, and H. Wang, “Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,”arXiv preprint arXiv:2504.18829, 2025

work page arXiv 2025
[28]

Robustdex- grasp: Robust dexterous grasping of general objects,

H. Zhang, Z. Wu, L. Huang, S. Christen, and J. Song, “Robustdex- grasp: Robust dexterous grasping of general objects,”arXiv preprint arXiv:2504.05287, 2025

work page arXiv 2025
[29]

Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,

K. Li, P. Li, T. Liu, Y . Li, and S. Huang, “Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,” in CVPR, pp. 6991–7003, 2025

work page 2025
[30]

Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,

Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu, “Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,”arXiv preprint arXiv:2410.24185, 2024

work page arXiv 2024
[31]

Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, H. Yin, S. Liu,et al., “Egovla: Learning vision-language-action models from egocentric human videos,”arXiv:2507.12440, 2025

work page arXiv 2025
[32]

Ta-vla: Elucidating the design space of torque-aware vision- language-action models,

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao, “Ta-vla: Elucidating the design space of torque-aware vision- language-action models,”arXiv preprint arXiv:2509.07962, 2025

work page arXiv 2025
[33]

Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,

Z. Zhang, C. Yue, H. Xu, M. Liao, X. Qi, H.-a. Gao, Z. Wang, and H. Zhao, “Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,”arXiv preprint arXiv:2509.08820, 2025

work page arXiv 2025
[34]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, H. Cui,et al., “Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,”arXiv preprint arXiv:2505.03233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang,et al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Dexgraspvla: A vision-language-action framework towards general dexterous grasping,

Y . Zhong, X. Huang, R. Li, C. Zhang, Y . Liang, Y . Yang, and Y . Chen, “Dexgraspvla: A vision-language-action framework towards general dexterous grasping,”arXiv preprint arXiv:2502.20900, 2025

work page arXiv 2025
[37]

Being-h0: Vision-language-action pretraining from large-scale human videos,

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu, “Being-h0: Vision-language-action pretraining from large-scale human videos,”arXiv preprint arXiv:2507.15597, 2025

work page arXiv 2025
[38]

Dreamgen: Unlocking generalization in robot learning through neural trajectories,

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin,et al., “Dreamgen: Unlocking generalization in robot learning through neural trajectories,” pp. arXiv–2505, 2025

work page 2025
[39]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Objaverse-xl: A universe of 10m+ 3d objects,

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre,et al., “Objaverse-xl: A universe of 10m+ 3d objects,”NeurIPS, vol. 36, 2023

work page 2023
[41]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, vol. 21, pp. 1–67, 2020

work page 2020
[42]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986, 2023

work page 2023
[43]

Discriminator-weighted offline imitation learning from suboptimal demonstrations,

H. Xu, X. Zhan, H. Yin, and H. Qin, “Discriminator-weighted offline imitation learning from suboptimal demonstrations,” inInternational Conference on Machine Learning, pp. 24725–24742, PMLR, 2022

work page 2022
[44]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”IJRR, 2023

work page 2023

[1] [1]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, PMLR, 2023

work page 2023

[2] [2]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Open- vla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai,et al., “π 0.5: a vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

GR-3 Technical Report

C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Ma,et al., “Gr-3 technical report,”arXiv preprint arXiv:2507.15493, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang,et al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Learning fine-grained bimanual manipulation with low-cost hardware,

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”RSS, 2023

work page 2023

[9] [9]

Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,

A. . Team, “Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,”arXiv preprint arXiv:2405.02292, 2024

work page arXiv 2024

[10] [10]

Open teach: A versatile teleoperation system for robotic manipulation,

A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,”CoRL, 2024

work page 2024

[11] [11]

Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,

R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang, “Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,”arXiv preprint arXiv:2407.03162, 2024

work page arXiv 2024

[12] [12]

Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,

Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox, “Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,”RSS, 2023

work page 2023

[13] [13]

Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,

S. Li, X. Ma, H. Liang, M. G ¨orner, P. Ruppel, B. Fang, F. Sun, and J. Zhang, “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,”ICRA, 2019

work page 2019

[14] [14]

Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,

H. Fang, H.-S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,”ICRA, 2024

work page 2024

[15] [15]

Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,

M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song, “Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,”CoRL, 2025

work page 2025

[16] [16]

Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,

A. Imdieke and K. Desingh, “Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,”arXiv preprint arXiv:2504.05488, 2025

work page arXiv 2025

[17] [17]

Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,

P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,” IROS, 2024

work page 2024

[18] [18]

How to train your robots? the impact of demonstration modality on imitation learning,

H. Li, Y . Cui, and D. Sadigh, “How to train your robots? the impact of demonstration modality on imitation learning,”arXiv preprint arXiv:2503.07017, 2025

work page arXiv 2025

[19] [19]

Language Conditioned Multi-Finger Dexterous Manipulation Enabled by Physical Compliance and Switching of Controllers

C. Pan, K. Junge, and J. Hughes, “Vision-language-action model and diffusion policy switching enables dexterous control of an anthropo- morphic hand,”arXiv preprint arXiv:2410.14022, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,

J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang, “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in8th CoRL, 2024

work page 2024

[21] [21]

Dex1b: Learning with 1b demonstrations for dexterous manipulation,

J. Ye, K. Wang, C. Yuan, R. Yang, Y . Li, J. Zhu, Y . Qin, X. Zou, and X. Wang, “Dex1b: Learning with 1b demonstrations for dexterous manipulation,”arXiv preprint arXiv:2506.17198, 2025

work page arXiv 2025

[22] [22]

G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,

Y . Ye, A. Gupta, K. Kitani, and S. Tulsiani, “G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,” in CVPR, pp. 1911–1920, 2024

work page 1911

[23] [23]

Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,

Y . Zhong, Q. Jiang, J. Yu, and Y . Ma, “Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,” in CVPR, pp. 22584–22594, 2025

work page 2025

[24] [24]

Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,

Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen,et al., “Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,” inCVPR, pp. 4737–4746, 2023

work page 2023

[25] [25]

Semgrasp: Semantic grasp generation via language aligned discretization,

K. Li, J. Wang, L. Yang, C. Lu, and B. Dai, “Semgrasp: Semantic grasp generation via language aligned discretization,” inECCV, 2024

work page 2024

[26] [26]

Realdex: Towards human-like grasping for robotic dexterous hand,

Y . Liu, Y . Yang, Y . Wang, X. Wu, J. Wang, Y . Yao, S. Schwertfeger, S. Yang, W. Wang, J. Yu,et al., “Realdex: Towards human-like grasping for robotic dexterous hand,”arXiv:2402.13853, 2024

work page arXiv 2024

[27] [27]

Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,

J. Chen, Y . Ke, L. Peng, and H. Wang, “Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,”arXiv preprint arXiv:2504.18829, 2025

work page arXiv 2025

[28] [28]

Robustdex- grasp: Robust dexterous grasping of general objects,

H. Zhang, Z. Wu, L. Huang, S. Christen, and J. Song, “Robustdex- grasp: Robust dexterous grasping of general objects,”arXiv preprint arXiv:2504.05287, 2025

work page arXiv 2025

[29] [29]

Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,

K. Li, P. Li, T. Liu, Y . Li, and S. Huang, “Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,” in CVPR, pp. 6991–7003, 2025

work page 2025

[30] [30]

Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,

Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu, “Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,”arXiv preprint arXiv:2410.24185, 2024

work page arXiv 2024

[31] [31]

Egovla: Learning vision-language-action models from egocentric human videos.arXiv preprint arXiv:2507.12440, 2025

R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, H. Yin, S. Liu,et al., “Egovla: Learning vision-language-action models from egocentric human videos,”arXiv:2507.12440, 2025

work page arXiv 2025

[32] [32]

Ta-vla: Elucidating the design space of torque-aware vision- language-action models,

Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao, “Ta-vla: Elucidating the design space of torque-aware vision- language-action models,”arXiv preprint arXiv:2509.07962, 2025

work page arXiv 2025

[33] [33]

Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,

Z. Zhang, C. Yue, H. Xu, M. Liao, X. Qi, H.-a. Gao, Z. Wang, and H. Zhao, “Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,”arXiv preprint arXiv:2509.08820, 2025

work page arXiv 2025

[34] [34]

GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data

S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, H. Cui,et al., “Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,”arXiv preprint arXiv:2505.03233, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang,et al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Dexgraspvla: A vision-language-action framework towards general dexterous grasping,

Y . Zhong, X. Huang, R. Li, C. Zhang, Y . Liang, Y . Yang, and Y . Chen, “Dexgraspvla: A vision-language-action framework towards general dexterous grasping,”arXiv preprint arXiv:2502.20900, 2025

work page arXiv 2025

[37] [37]

Being-h0: Vision-language-action pretraining from large-scale human videos,

H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu, “Being-h0: Vision-language-action pretraining from large-scale human videos,”arXiv preprint arXiv:2507.15597, 2025

work page arXiv 2025

[38] [38]

Dreamgen: Unlocking generalization in robot learning through neural trajectories,

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin,et al., “Dreamgen: Unlocking generalization in robot learning through neural trajectories,” pp. arXiv–2505, 2025

work page 2025

[39] [39]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Objaverse-xl: A universe of 10m+ 3d objects,

M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre,et al., “Objaverse-xl: A universe of 10m+ 3d objects,”NeurIPS, vol. 36, 2023

work page 2023

[41] [41]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, vol. 21, pp. 1–67, 2020

work page 2020

[42] [42]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986, 2023

work page 2023

[43] [43]

Discriminator-weighted offline imitation learning from suboptimal demonstrations,

H. Xu, X. Zhan, H. Yin, and H. Qin, “Discriminator-weighted offline imitation learning from suboptimal demonstrations,” inInternational Conference on Machine Learning, pp. 24725–24742, PMLR, 2022

work page 2022

[44] [44]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”IJRR, 2023

work page 2023