Dexora: Open-source VLA for High-DoF Bimanual Dexterity
Pith reviewed 2026-05-20 09:39 UTC · model grok-4.3
The pith
Dexora shows that a quality-weighted VLA trained on matched synthetic and real teleoperation data can learn effective high-DoF bimanual dexterity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining a hybrid exoskeleton-and-vision teleoperation interface with a 100K-trajectory synthetic corpus and 10K real episodes, then training a diffusion-transformer policy under clip-level weights from an offline discriminator, Dexora creates an effective open-source VLA for high-DoF bimanual dexterity that outperforms competitive baselines on both basic and dexterous benchmarks, reaches 90 percent success on basic tasks, and exhibits robust out-of-distribution and cross-embodiment generalization.
What carries the argument
The data-quality-aware training recipe that uses an offline discriminator to supply clip-level weights for down-weighting low-quality teleoperation demonstrations inside diffusion-transformer policy training on combined synthetic and real data.
If this is right
- Reaches 90 percent success on basic manipulation tasks.
- Achieves 66.7 percent average success on dexterous benchmarks versus 51.7 percent for prior VLA baselines.
- Demonstrates robust generalization to out-of-distribution inputs and to different robot embodiments.
- Ablations confirm that both the inclusion of real data and the discriminator weighting are necessary for high dexterity performance.
Where Pith is reading between the lines
- Open-sourcing the full system and datasets could let other groups extend the approach to additional high-DoF platforms without starting from scratch.
- The hybrid teleoperation pipeline could be reused to collect higher-quality data for tasks requiring even finer finger coordination than the current benchmarks.
- Further increases in the volume of real-world episodes might produce additional gains in generalization beyond the out-of-distribution cases already tested.
Load-bearing premise
The offline discriminator can reliably assign clip-level weights that reduce the effect of noisy teleoperation demonstrations enough for the policy to learn high-DoF bimanual control from the mixed datasets.
What would settle it
Training the same diffusion-transformer policy on the combined datasets without the discriminator weights and measuring whether average dexterous success falls to or below the reported 51.7 percent baseline level.
Figures
read the original abstract
Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. Dexora presents the first open-source VLA model for high-DoF bimanual dexterity with dual arms and dual hands. It introduces a hybrid teleoperation pipeline (exoskeleton for gross arm motion, Apple Vision Pro for finger tracking) that drives both a physical platform and MuJoCo twin, assembles a 100K-trajectory synthetic corpus plus 10K real teleoperated episodes, and trains a diffusion-transformer policy with an offline discriminator that supplies clip-level weights to down-weight noisy demonstrations. The paper reports 90% success on basic tasks, 66.7% average dexterous success (vs. 51.7% for competitive VLA baselines), plus robust OOD and cross-embodiment generalization, with ablations attributing gains to real data and the discriminator.
Significance. If the empirical claims hold under rigorous controls, this would be a notable contribution as the first open-source VLA explicitly targeting high-dimensional dual-arm/dual-hand control. The hybrid data recipe and quality-aware weighting address a practical bottleneck in scaling dexterous policies; reproducible open-source release of model, data, and interface would further amplify impact on embodied AI research.
major comments (2)
- The central performance claim (66.7% dexterous success vs. 51.7% baselines, plus OOD/cross-embodiment gains) rests on the offline discriminator producing clip-level weights that meaningfully down-weight noisy teleoperation episodes. The abstract states that ablations confirm the discriminator's importance, yet the manuscript provides no direct validation of this mechanism: no weight histograms, no qualitative examples of high- versus low-weight clips, and no reported correlation between assigned weights and independent quality metrics such as trajectory smoothness or human-rated task completion. Without these checks it remains possible that reported gains derive from dataset scale, the hybrid interface, or model capacity rather than the weighting itself.
- Baseline comparisons and statistical reporting lack necessary detail for reproducibility and fairness. The abstract cites concrete success-rate deltas but supplies no information on how the competitive VLA baselines were implemented (e.g., exact architectures, training hyperparameters, or whether they received the same mixed synthetic+real corpus), no statistical tests or confidence intervals on the reported percentages, and no precise definitions of the basic versus dexterous task suites or success criteria.
minor comments (2)
- Notation for the discriminator output (clip-level weights) should be introduced with an explicit equation or pseudocode block so readers can trace how weights are applied inside the diffusion-transformer loss.
- Figure captions for the teleoperation interface and dataset examples would benefit from additional labels indicating which components are synthetic versus real and which arm/hand DoFs are being controlled.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below and outline the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: The central performance claim (66.7% dexterous success vs. 51.7% baselines, plus OOD/cross-embodiment gains) rests on the offline discriminator producing clip-level weights that meaningfully down-weight noisy teleoperation episodes. The abstract states that ablations confirm the discriminator's importance, yet the manuscript provides no direct validation of this mechanism: no weight histograms, no qualitative examples of high- versus low-weight clips, and no reported correlation between assigned weights and independent quality metrics such as trajectory smoothness or human-rated task completion. Without these checks it remains possible that reported gains derive from dataset scale, the hybrid interface, or model capacity rather than the weighting itself.
Authors: We agree that direct evidence for the discriminator's weighting mechanism would strengthen the claims. Our existing ablations show performance drops when the weighting is removed, but we acknowledge the absence of supporting visualizations and correlations. In the revised manuscript we will add: (i) histograms of clip-level weights across the real dataset, (ii) qualitative examples of high- versus low-weight trajectories with corresponding smoothness metrics, and (iii) a correlation analysis between assigned weights and independent quality indicators (trajectory jerk and human-rated task completion on a held-out subset). These additions will help isolate the contribution of the weighting from dataset scale and model capacity. revision: yes
-
Referee: Baseline comparisons and statistical reporting lack necessary detail for reproducibility and fairness. The abstract cites concrete success-rate deltas but supplies no information on how the competitive VLA baselines were implemented (e.g., exact architectures, training hyperparameters, or whether they received the same mixed synthetic+real corpus), no statistical tests or confidence intervals on the reported percentages, and no precise definitions of the basic versus dexterous task suites or success criteria.
Authors: We accept that additional implementation and statistical details are required for reproducibility. In the revised version we will expand the experimental section to include: (1) exact architectures, training hyperparameters, and data corpus details for each baseline (confirming use of the same mixed synthetic+real data where applicable), (2) precise definitions of the basic and dexterous task suites together with success criteria, and (3) statistical significance tests with 95% confidence intervals computed over multiple random seeds for all reported success rates. These changes will ensure fair and reproducible comparisons. revision: yes
Circularity Check
No circularity: empirical performance claims rest on experimental training and evaluation, not on any derivation that reduces to inputs by construction.
full rationale
The paper describes an empirical pipeline: hybrid teleoperation data collection (synthetic 100K trajectories + real 10K episodes), an offline discriminator that assigns clip-level weights, and training of a diffusion-transformer policy whose success rates (66.7% dexterous, 90% basic) are measured on held-out benchmarks. No equations, uniqueness theorems, or first-principles derivations are invoked whose outputs are definitionally equivalent to the fitted weights or the training corpus itself. The discriminator weighting is a modeling choice whose effectiveness is asserted via ablations, but the reported numbers are direct experimental outcomes rather than predictions forced by the paper's own definitions or self-citations. The derivation chain is therefore self-contained as standard supervised learning on collected data.
Axiom & Free-Parameter Ledger
free parameters (1)
- clip-level discriminator weights
axioms (1)
- domain assumption Hybrid exoskeleton-plus-markerless tracking produces demonstrations whose quality distribution can be meaningfully scored by an offline discriminator
Reference graph
Works this paper leans on
-
[1]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inCoRL, PMLR, 2023
work page 2023
-
[2]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P. Sanketi,et al., “Open- vla: An open-source vision-language-action model,”arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter,et al., “π 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai,et al., “π 0.5: a vision-language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “Rdt-1b: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
C. Cheang, S. Chen, Z. Cui, Y . Hu, L. Huang, T. Kong, H. Li, Y . Li, Y . Liu, X. Ma,et al., “Gr-3 technical report,”arXiv preprint arXiv:2507.15493, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huang,et al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Learning fine-grained bimanual manipulation with low-cost hardware,
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”RSS, 2023
work page 2023
-
[9]
Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,
A. . Team, “Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,”arXiv preprint arXiv:2405.02292, 2024
-
[10]
Open teach: A versatile teleoperation system for robotic manipulation,
A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,”CoRL, 2024
work page 2024
-
[11]
Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,
R. Ding, Y . Qin, J. Zhu, C. Jia, S. Yang, R. Yang, X. Qi, and X. Wang, “Bunny-visionpro: Real-time bimanual dexterous teleoperation for imitation learning,”arXiv preprint arXiv:2407.03162, 2024
-
[12]
Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,
Y . Qin, W. Yang, B. Huang, K. Van Wyk, H. Su, X. Wang, Y .-W. Chao, and D. Fox, “Anyteleop: A general vision-based dexterous robot arm- hand teleoperation system,”RSS, 2023
work page 2023
-
[13]
Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,
S. Li, X. Ma, H. Liang, M. G ¨orner, P. Ruppel, B. Fang, F. Sun, and J. Zhang, “Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network,”ICRA, 2019
work page 2019
-
[14]
Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,
H. Fang, H.-S. Fang, Y . Wang, J. Ren, J. Chen, R. Zhang, W. Wang, and C. Lu, “Airexo: Low-cost exoskeletons for learning whole-arm manipulation in the wild,”ICRA, 2024
work page 2024
-
[15]
Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,
M. Xu, H. Zhang, Y . Hou, Z. Xu, L. Fan, M. Veloso, and S. Song, “Dexumi: Using human hand as the universal manipulation interface for dexterous manipulation,”CoRL, 2025
work page 2025
-
[16]
Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,
A. Imdieke and K. Desingh, “Spark-remote: A cost-effective sys- tem for remote bimanual robot teleoperation,”arXiv preprint arXiv:2504.05488, 2025
-
[17]
Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,
P. Wu, Y . Shentu, Z. Yi, X. Lin, and P. Abbeel, “Gello: A general, low- cost, and intuitive teleoperation framework for robot manipulators,” IROS, 2024
work page 2024
-
[18]
How to train your robots? the impact of demonstration modality on imitation learning,
H. Li, Y . Cui, and D. Sadigh, “How to train your robots? the impact of demonstration modality on imitation learning,”arXiv preprint arXiv:2503.07017, 2025
-
[19]
C. Pan, K. Junge, and J. Hughes, “Vision-language-action model and diffusion policy switching enables dexterous control of an anthropo- morphic hand,”arXiv preprint arXiv:2410.14022, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,
J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang, “Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes,” in8th CoRL, 2024
work page 2024
-
[21]
Dex1b: Learning with 1b demonstrations for dexterous manipulation,
J. Ye, K. Wang, C. Yuan, R. Yang, Y . Li, J. Zhu, Y . Qin, X. Zou, and X. Wang, “Dex1b: Learning with 1b demonstrations for dexterous manipulation,”arXiv preprint arXiv:2506.17198, 2025
-
[22]
G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,
Y . Ye, A. Gupta, K. Kitani, and S. Tulsiani, “G-hop: Generative hand- object prior for interaction reconstruction and grasp synthesis,” in CVPR, pp. 1911–1920, 2024
work page 1911
-
[23]
Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,
Y . Zhong, Q. Jiang, J. Yu, and Y . Ma, “Dexgrasp anything: Towards universal robotic dexterous grasping with physics awareness,” in CVPR, pp. 22584–22594, 2025
work page 2025
-
[24]
Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen,et al., “Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy,” inCVPR, pp. 4737–4746, 2023
work page 2023
-
[25]
Semgrasp: Semantic grasp generation via language aligned discretization,
K. Li, J. Wang, L. Yang, C. Lu, and B. Dai, “Semgrasp: Semantic grasp generation via language aligned discretization,” inECCV, 2024
work page 2024
-
[26]
Realdex: Towards human-like grasping for robotic dexterous hand,
Y . Liu, Y . Yang, Y . Wang, X. Wu, J. Wang, Y . Yao, S. Schwertfeger, S. Yang, W. Wang, J. Yu,et al., “Realdex: Towards human-like grasping for robotic dexterous hand,”arXiv:2402.13853, 2024
-
[27]
Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,
J. Chen, Y . Ke, L. Peng, and H. Wang, “Dexonomy: Synthesiz- ing all dexterous grasp types in a grasp taxonomy,”arXiv preprint arXiv:2504.18829, 2025
-
[28]
Robustdex- grasp: Robust dexterous grasping of general objects,
H. Zhang, Z. Wu, L. Huang, S. Christen, and J. Song, “Robustdex- grasp: Robust dexterous grasping of general objects,”arXiv preprint arXiv:2504.05287, 2025
-
[29]
Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,
K. Li, P. Li, T. Liu, Y . Li, and S. Huang, “Maniptrans: Efficient dexterous bimanual manipulation transfer via residual learning,” in CVPR, pp. 6991–7003, 2025
work page 2025
-
[30]
Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,
Z. Jiang, Y . Xie, K. Lin, Z. Xu, W. Wan, A. Mandlekar, L. Fan, and Y . Zhu, “Dexmimicgen: Automated data generation for biman- ual dexterous manipulation via imitation learning,”arXiv preprint arXiv:2410.24185, 2024
-
[31]
R. Yang, Q. Yu, Y . Wu, R. Yan, B. Li, A.-C. Cheng, X. Zou, Y . Fang, H. Yin, S. Liu,et al., “Egovla: Learning vision-language-action models from egocentric human videos,”arXiv:2507.12440, 2025
-
[32]
Ta-vla: Elucidating the design space of torque-aware vision- language-action models,
Z. Zhang, H. Xu, Z. Yang, C. Yue, Z. Lin, H.-a. Gao, Z. Wang, and H. Zhao, “Ta-vla: Elucidating the design space of torque-aware vision- language-action models,”arXiv preprint arXiv:2509.07962, 2025
-
[33]
Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,
Z. Zhang, C. Yue, H. Xu, M. Liao, X. Qi, H.-a. Gao, Z. Wang, and H. Zhao, “Robochemist: Long-horizon and safety-compliant robotic chemical experimentation,”arXiv preprint arXiv:2509.08820, 2025
-
[34]
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data
S. Deng, M. Yan, S. Wei, H. Ma, Y . Yang, J. Chen, Z. Zhang, T. Yang, X. Zhang, H. Cui,et al., “Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data,”arXiv preprint arXiv:2505.03233, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Q. Bu, J. Cai, L. Chen, X. Cui, Y . Ding, S. Feng, S. Gao, X. He, X. Hu, X. Huang,et al., “Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems,”arXiv preprint arXiv:2503.06669, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Dexgraspvla: A vision-language-action framework towards general dexterous grasping,
Y . Zhong, X. Huang, R. Li, C. Zhang, Y . Liang, Y . Yang, and Y . Chen, “Dexgraspvla: A vision-language-action framework towards general dexterous grasping,”arXiv preprint arXiv:2502.20900, 2025
-
[37]
Being-h0: Vision-language-action pretraining from large-scale human videos,
H. Luo, Y . Feng, W. Zhang, S. Zheng, Y . Wang, H. Yuan, J. Liu, C. Xu, Q. Jin, and Z. Lu, “Being-h0: Vision-language-action pretraining from large-scale human videos,”arXiv preprint arXiv:2507.15597, 2025
-
[38]
Dreamgen: Unlocking generalization in robot learning through neural trajectories,
J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Lin,et al., “Dreamgen: Unlocking generalization in robot learning through neural trajectories,” pp. arXiv–2505, 2025
work page 2025
-
[39]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang,et al., “Qwen2. 5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Objaverse-xl: A universe of 10m+ 3d objects,
M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V . V oleti, S. Y . Gadre,et al., “Objaverse-xl: A universe of 10m+ 3d objects,”NeurIPS, vol. 36, 2023
work page 2023
-
[41]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, vol. 21, pp. 1–67, 2020
work page 2020
-
[42]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inProceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986, 2023
work page 2023
-
[43]
Discriminator-weighted offline imitation learning from suboptimal demonstrations,
H. Xu, X. Zhan, H. Yin, and H. Qin, “Discriminator-weighted offline imitation learning from suboptimal demonstrations,” inInternational Conference on Machine Learning, pp. 24725–24742, PMLR, 2022
work page 2022
-
[44]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”IJRR, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.