pith. sign in

arxiv: 2606.09314 · v1 · pith:MV4FVKO4new · submitted 2026-06-08 · 💻 cs.RO

KPGrasp: Scalable Keypoint Flow Matching for Dexterous Grasp Generation

Pith reviewed 2026-06-27 16:31 UTC · model grok-4.3

classification 💻 cs.RO
keywords dexterous grasp generationflow matchingkeypoint parameterizationtransformer modelgrasp success ratesimulation benchmarksreal-robot deployment
0
0 comments X

The pith

Keypoint flow matching with an all-Euclidean hand parameterization generates dexterous grasps at 76.3 percent success using only the standard loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KPGrasp as a flow-matching approach that learns priors for dexterous grasping directly from large-scale data. It replaces mixed pose and joint-angle outputs with a uniform 3D keypoint representation of the hand that sits in the same coordinate frame as the object point cloud. A Transformer model is trained solely on the standard flow-matching objective, without any contact losses or post-processing refinement steps. On the Dexonomy benchmark this yields a 76.3 percent grasp success rate and 2.4 mm penetration depth, a 47.4 percent lift over the strongest comparable baseline, while the same weights also lead the DexGrasp Anything benchmark and run at 0.032 seconds per grasp in batch mode. Real-robot trials on twenty objects confirm that the learned distribution transfers without additional tuning.

Core claim

KPGrasp shows that dexterous grasp distributions can be modeled by transporting hand keypoints in Euclidean 3D space with a scalable Transformer flow model trained exclusively on the standard flow-matching loss; the resulting generator reaches 76.3 percent success on Dexonomy with 2.4 mm penetration, improves 47.4 percent over the best directly comparable baseline, leads the DexGrasp Anything benchmark without fine-tuning, and transfers to twenty real objects at 0.032 s per grasp.

What carries the argument

All-Euclidean 3D hand-keypoint parameterization that places grasp configurations in the same frame as the object point cloud, allowing a Transformer flow model to perform native spatial reasoning under the plain flow-matching objective.

If this is right

  • Reaches 76.3 percent grasp success rate on the Dexonomy benchmark while cutting penetration depth to 2.4 mm.
  • Improves 47.4 percent over the strongest directly comparable baseline that also avoids contact losses.
  • Leads the DexGrasp Anything benchmark in average performance without any fine-tuning.
  • Runs at 0.032 seconds per grasp under batched inference.
  • Transfers directly to real-robot execution on twenty diverse objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Keypoint representations may reduce the need for hand-crafted contact terms across other contact-rich manipulation tasks.
  • The observed scaling with data volume and model size suggests further gains are available from larger training corpora.
  • Removing contact-loss engineering could shorten the iteration cycle when adapting the method to new robot hands.
  • The same flow-matching pipeline might be tested on in-hand reorientation or tool-use sequences without redesigning the loss.

Load-bearing premise

The all-Euclidean 3D hand-keypoint parameterization is expressive enough to represent valid, collision-free dexterous grasps that generalize from large-scale training data to both simulation benchmarks and real-world objects.

What would settle it

A new test set of objects where the generated keypoint trajectories match the learned distribution yet produce frequent joint-limit violations or undetected collisions in full physics simulation would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.09314 by Bing Han, Haoran Liu, He Wang, Jiangran Lyu, Jiayi Chen, Li Yi, Mi Yan, Yuansen Huang, Yubin Ke.

Figure 1
Figure 1. Figure 1: Overview. Unlike prior methods that use pose-joint parameterizations and rely on aux￾iliary contact losses or contact-based test-time refinement, KPGrasp parametrizes grasps with Eu￾clidean hand keypoints and learns a simple yet scalable Transformer-based flow-matching model. KPGrasp achieves state-of-the-art performance on two simulation benchmarks and generates high￾quality, diverse dexterous grasps. 1 I… view at source ↗
Figure 2
Figure 2. Figure 2: Motivation for Keypoint Output Parameterization. Traditional pose-joint outputs mix non-Euclidean wrist pose with joint angles, leading to SO(3) discontinuities [23], kinematic error accumulation, and loss balancing issues. Our keypoint outputs place the hand in Euclidean space, align naturally with object point clouds, and map cleanly to Transformer tokens. the high-dimensional kinematics of the hand. Tra… view at source ↗
Figure 3
Figure 3. Figure 3: KPGrasp Model. We use a conditional DiT-based network to learn the keypoint flow. The model is trained only with the standard flow-matching objective. During inference, the predicted keypoints are converted into joint angles by inverse kinematics for execution. The right panel shows the adaptive layer normalization that injects the flow time τ into each attention or FFN sublayer. Training. KPGrasp is train… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparison on Dexonomy. KPGrasp produces stable and diverse grasps on complex objects, while baselines often have severe hand-object penetration. Quantitative Results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scaling Behavior. KPGrasp improves as training data, model size, and batch size increase. Keypoint parameterization consistently outperforms the conventional pose-joint variant. rate and penetration depth in millimeters. Note that this protocol differs from Dexonomy: success is evaluated in IsaacGym [48], and penetration is computed from point clouds and point normals. Quantitative Results [PITH_FULL_IMAG… view at source ↗
Figure 6
Figure 6. Figure 6: Real World Setup. (Left) 20 test objects. (Right) Robot setup. Finally, we evaluate KPGrasp on partial point clouds in the real world [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sharpa Hand Keypoint Annotation. We annotate J = 21 keypoints on Sharpa Hand for real world experiment. We use J = 21 keypoints for Sharpa Hand as shown in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Generating high-quality dexterous grasps remains challenging for learning-based methods, which often depend on carefully tuned contact losses or costly contact-based test-time refinement. We present KPGrasp, a flow-matching framework that learns dexterous grasp priors from large-scale data rather than relying on contact losses or contact-based test-time refinement. KPGrasp couples an all-Euclidean 3D hand-keypoint parameterization with a simple yet scalable Transformer flow model. The parameterization avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space, expresses grasps in the same frame as the object point cloud, and thus enables native spatial reasoning; the Transformer flow model is trained with only the standard flow-matching loss and scales effectively with data, model capacity, and batch size. Experiments demonstrate state-of-the-art performance on two simulation benchmarks. On the Dexonomy benchmark, it reaches a 76.3% grasp success rate, improving over the strongest directly comparable baseline by 47.4% while reducing penetration depth to 2.4 mm. The same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning. For batched inference, KPGrasp requires only 0.032 s per grasp. Finally, real-world experiments on 20 diverse objects demonstrate that the pipeline can be deployed in a real-world setup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces KPGrasp, a flow-matching framework for dexterous grasp generation that couples an all-Euclidean 3D hand-keypoint parameterization with a scalable Transformer model. The approach is trained using only the standard flow-matching loss on large-scale data, without contact losses or contact-based test-time refinement. It reports state-of-the-art results on the Dexonomy benchmark (76.3% grasp success rate, 47.4% improvement over the strongest comparable baseline, 2.4 mm penetration depth) and best average performance on the DexGrasp Anything benchmark without fine-tuning, along with 0.032 s per-grasp batched inference and successful real-world deployment on 20 diverse objects.

Significance. If the experimental claims hold under detailed scrutiny, the work shows that a simple, scalable data-driven flow model on a Euclidean keypoint representation can deliver strong dexterous grasping performance without specialized contact terms. This would be a meaningful simplification for the field and supports the value of scaling model capacity, batch size, and training data in grasp generation.

major comments (2)
  1. [§3] §3 (Method, keypoint parameterization subsection): The central claim that the standard flow-matching loss alone suffices for 76.3% success and 2.4 mm penetration rests on the all-Euclidean 3D keypoint parameterization being expressive enough to represent only valid, collision-free grasps. The text states that this parameterization 'avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space' and 'enables native spatial reasoning,' but provides no explicit mechanism (fixed bone lengths, joint limits, or intra-hand collision avoidance) to keep generated keypoints on the reachable manifold. Without such constraints or a described decoding/projection step, it is unclear whether performance derives from the flow model or from training-data statistics; this must be clarified with concrete evidence from the model architecture or post-processing.
  2. [§5] §5 (Experiments, Dexonomy benchmark results): The reported 76.3% success rate and 47.4% relative improvement are load-bearing for the 'no contact losses or refinement' thesis. The abstract supplies no protocol details, baseline definitions, statistical significance, or ablation evidence; the full experimental section must include these (e.g., exact success metric definition, number of trials, variance across seeds, and direct comparison to the 'strongest directly comparable baseline') to substantiate that the gains are not artifacts of unstated post-processing.
minor comments (2)
  1. The abstract states 'the same model also achieves the best average performance on the DexGrasp Anything benchmark without fine-tuning' but does not define 'average performance' (success rate, penetration, or a composite metric). Clarify the metric and report per-object or per-category breakdowns for transparency.
  2. Real-world experiments are mentioned on 20 objects but lack quantitative metrics (success rate, penetration) or failure-mode analysis comparable to the simulation benchmarks. Adding these would strengthen the deployment claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to improve clarity on the keypoint parameterization and to expand experimental protocol details.

read point-by-point responses
  1. Referee: [§3] §3 (Method, keypoint parameterization subsection): The central claim that the standard flow-matching loss alone suffices for 76.3% success and 2.4 mm penetration rests on the all-Euclidean 3D keypoint parameterization being expressive enough to represent only valid, collision-free grasps. The text states that this parameterization 'avoids the drawbacks of the conventional mixed SE(3) pose and joint-angle output space' and 'enables native spatial reasoning,' but provides no explicit mechanism (fixed bone lengths, joint limits, or intra-hand collision avoidance) to keep generated keypoints on the reachable manifold. Without such constraints or a described decoding/projection step, it is unclear whether performance derives from the flow model or from training-data statistics; this must be clarified with concrete evidence from the model architecture or post-processing.

    Authors: The all-Euclidean 3D keypoint parameterization contains no explicit mechanisms such as fixed bone lengths, joint limits, or intra-hand collision avoidance, either in the architecture or as post-processing. The model is trained solely with the standard flow-matching loss on large-scale valid grasp data; the Transformer learns the distribution of feasible keypoint configurations directly from that data, which empirically yields low penetration without contact terms. We will revise §3 to state this explicitly and add a short discussion of why hard constraints were omitted in favor of a purely data-driven approach. revision: yes

  2. Referee: [§5] §5 (Experiments, Dexonomy benchmark results): The reported 76.3% success rate and 47.4% relative improvement are load-bearing for the 'no contact losses or refinement' thesis. The abstract supplies no protocol details, baseline definitions, statistical significance, or ablation evidence; the full experimental section must include these (e.g., exact success metric definition, number of trials, variance across seeds, and direct comparison to the 'strongest directly comparable baseline') to substantiate that the gains are not artifacts of unstated post-processing.

    Authors: We agree that the experimental section requires additional protocol details to support the claims. In the revision we will specify the exact success metric (object lift without drop within a fixed time), the number of evaluation trials per object, reported variance across random seeds, and the precise definition of the strongest directly comparable baseline. No contact-based post-processing or refinement was applied at test time; the reported numbers reflect direct sampling from the flow model. We will also include the requested ablation evidence in §5. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from standard flow-matching on external benchmarks

full rationale

The paper describes a data-driven Transformer flow model trained solely with the standard flow-matching loss on large-scale data, using an all-Euclidean keypoint parameterization. Reported metrics (76.3% success, 2.4 mm penetration) are measured on independent simulation benchmarks (Dexonomy, DexGrasp Anything) and real-world objects; they do not reduce to quantities defined inside the paper by construction, fitted parameters renamed as predictions, or self-citation chains. The method explicitly avoids contact losses and test-time refinement, making performance an external validation rather than a tautology. No load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete; the central claim rests on the assumption that large-scale grasp data plus a standard flow objective suffice when paired with the keypoint representation.

free parameters (1)
  • Transformer capacity, batch size, and training data scale
    The abstract states that the model scales effectively with data, model capacity, and batch size, implying these are chosen hyperparameters.
axioms (1)
  • domain assumption Large-scale grasp datasets contain representative examples of valid dexterous grasps that can be learned via flow matching without explicit contact modeling
    The framework learns priors from large-scale data rather than relying on contact losses.

pith-pipeline@v0.9.1-grok · 5799 in / 1337 out tokens · 33515 ms · 2026-06-27T16:31:31.739363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 24 canonical work pages · 6 internal anchors

  1. [1]

    A. T. Miller and P. K. Allen. Graspit! a versatile simulator for robotic grasping.IEEE Robotics & Automation Magazine, 11(4):110–122, 2004

  2. [2]

    Ciocarlie, C

    M. Ciocarlie, C. Goldfeder, and P. Allen. Dexterous grasping via eigengrasps: A low- dimensional approach to a high-complexity problem. InRobotics: Science and systems ma- nipulation workshop-sensing and adapting to the real world, 2007

  3. [3]

    T. Liu, Z. Liu, Z. Jiao, Y . Zhu, and S.-C. Zhu. Synthesizing diverse and physically stable grasps with arbitrary hand structures using differentiable force closure estimator.IEEE Robotics and Automation Letters, 7(1):470–477, 2021

  4. [4]

    A. H. Li, P. Culbertson, J. W. Burdick, and A. D. Ames. Frogger: Fast robust grasp generation via the min-weight metric. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6809–6816. IEEE, 2023

  5. [5]

    S. Chen, J. Bohg, and C. K. Liu. Springgrasp: An optimization pipeline for robust and com- pliant dexterous pre-grasp synthesis.arXiv preprint arXiv:2404.13532, 2024

  6. [6]

    J. Chen, Y . Chen, J. Zhang, and H. Wang. Task-oriented dexterous hand pose synthesis using differentiable grasp wrench boundary estimator. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 5281–5288. IEEE, 2024

  7. [7]

    J. Chen, Y . Ke, and H. Wang. Bodex: Scalable and efficient robotic dexterous grasp synthesis using bilevel optimization. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 01–08. IEEE, 2025

  8. [8]

    J. Chen, Y . Ke, L. Peng, and H. Wang. Dexonomy: Synthesizing all dexterous grasp types in a grasp taxonomy.arXiv preprint arXiv:2504.18829, 2025

  9. [9]

    Ferrari, J

    C. Ferrari, J. F. Canny, et al. Planning optimal grasps. InICRA, volume 3, page 6, 1992

  10. [10]

    R. Wang, J. Zhang, J. Chen, Y . Xu, P. Li, T. Liu, and H. Wang. Dexgraspnet: A large- scale robotic dexterous grasp dataset for general objects based on simulation.arXiv preprint arXiv:2210.02697, 2022

  11. [11]

    Zhang, H

    J. Zhang, H. Liu, D. Li, X. Yu, H. Geng, Y . Ding, J. Chen, and H. Wang. Dexgraspnet 2.0: Learning generative dexterous grasping in large-scale synthetic cluttered scenes. In8th Annual Conference on Robot Learning, 2024. 11

  12. [12]

    J. Ye, K. Wang, C. Yuan, R. Yang, Y . Li, J. Zhu, Y . Qin, X. Zou, and X. Wang. Dex1b: Learning with 1b demonstrations for dexterous manipulation.arXiv preprint arXiv:2506.17198, 2025

  13. [13]

    J. He, D. Li, X. Yu, Z. Qi, W. Zhang, J. Chen, Z. Zhang, Z. Zhang, L. Yi, and H. Wang. Dexvlg: Dexterous vision-language-grasp model at scale. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 14248–14258, 2025

  14. [14]

    T. Feix, J. Romero, H.-B. Schmiedmayer, A. M. Dollar, and D. Kragic. The grasp taxonomy of human grasp types.IEEE Transactions on human-machine systems, 46(1):66–77, 2015

  15. [15]

    Jiang, S

    H. Jiang, S. Liu, J. Wang, and X. Wang. Hand-object contact consistency reasoning for human grasps generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 11107–11116, 2021

  16. [16]

    T. Zhu, R. Wu, X. Lin, and Y . Sun. Toward human-like grasp: Dexterous grasping via semantic representation of object-hand. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15741–15751, 2021

  17. [17]

    Xu, Y .-L

    G.-H. Xu, Y .-L. Wei, D. Zheng, X.-M. Wu, and W.-S. Zheng. Dexterous grasp transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17933–17942, 2024

  18. [18]

    Zhong, Q

    Y . Zhong, Q. Jiang, J. Yu, and Y . Ma. Dexgrasp anything: Towards universal robotic dex- terous grasping with physics awareness. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22584–22594, 2025

  19. [19]

    Y . Xu, W. Wan, J. Zhang, H. Liu, Z. Shan, H. Shen, R. Wang, H. Geng, Y . Weng, J. Chen, et al. Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4737–4746, 2023

  20. [20]

    P. Li, T. Liu, Y . Li, Y . Geng, Y . Zhu, Y . Yang, and S. Huang. Gendexgrasp: Generalizable dexterous grasping.arXiv preprint arXiv:2210.00722, 2022

  21. [21]

    J. Lu, H. Kang, H. Li, B. Liu, Y . Yang, Q. Huang, and G. Hua. Ugg: Unified generative grasping. InEuropean Conference on Computer Vision, pages 414–433. Springer, 2024

  22. [22]

    Z. Weng, H. Lu, D. Kragic, and J. Lundell. Dexdiffuser: Generating dexterous grasps with diffusion models.IEEE Robotics and Automation Letters, 9(12):11834–11840, 2024

  23. [23]

    Y . Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5745–5753, 2019

  24. [24]

    M. Liu, Z. Pan, K. Xu, K. Ganguly, and D. Manocha. Deep differentiable grasp planner for high-dof grippers.arXiv preprint arXiv:2002.01530, 2020

  25. [25]

    Z. Q. Chen, K. Van Wyk, Y .-W. Chao, W. Yang, A. Mousavian, A. Gupta, and D. Fox. Learning robust real-world dexterous grasping policies via implicit shape augmentation.arXiv preprint arXiv:2210.13638, 2022

  26. [26]

    Q. Feng, J. Feng, Z. Chen, R. Triebel, and A. Knoll. Ffhflow: Diverse and uncertainty-aware dexterous grasp generation via flow variational inference.arXiv preprint arXiv:2407.15161, 2024

  27. [27]

    Z. Wei, Z. Xu, J. Guo, Y . Hou, C. Gao, Z. Cai, J. Luo, and L. Shao. D (r, o) grasp: A unified representation of robot and object interaction for cross-embodiment dexterous grasping.arXiv preprint arXiv:2410.01702, 2024. 12

  28. [28]

    Y . Shi, Z. Guo, R. Wolf, E. Welte, and R. Rayyes. Hograspflow: Exploring vision-based generative grasp synthesis with hand-object priors and taxonomy awareness.arXiv preprint arXiv:2509.16871, 2025

  29. [29]

    B. Lim, J. Kim, J. Kim, Y . Lee, and F. C. Park. Equigraspflow: Se (3)-equivariant 6-dof grasp pose generative flows. In8th Annual Conference on Robot Learning, 2024

  30. [30]

    X. Zhu, F. Wang, R. Walters, and J. Shi. Se (3)-equivariant diffusion policy in spherical fourier space.arXiv preprint arXiv:2507.01723, 2025

  31. [31]

    Cheng, J

    W. Cheng, J. H. Park, and J. H. Ko. Handfoldingnet: A 3d hand pose estimation network using multiscale-feature guided folding of a 2d hand skeleton. InProceedings of the IEEE/CVF international conference on computer vision, pages 11260–11269, 2021

  32. [32]

    J. Chen, M. Yan, J. Zhang, Y . Xu, X. Li, Y . Weng, L. Yi, S. Song, and H. Wang. Tracking and reconstructing hand object interactions from point cloud sequences in the wild. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 304–312, 2023

  33. [33]

    Cheng, H

    W. Cheng, H. Tang, L. Van Gool, and J. H. Ko. Handdiff: 3d hand pose estimation with diffusion on image-point cloud. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2274–2284, 2024

  34. [34]

    Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

    S. Haldar and L. Pinto. Point policy: Unifying observations and actions with key points for robot manipulation.arXiv preprint arXiv:2502.20391, 2025

  35. [35]

    R. G. Goswami, A. Bar, D. Fan, T.-Y . Yang, G. Zhou, P. Krishnamurthy, M. Rabbat, F. Khor- rami, and Y . LeCun. World models can leverage human videos for dexterous manipulation. arXiv preprint arXiv:2512.13644, 2025

  36. [36]

    Guzey, H

    I. Guzey, H. Qi, J. Urain, C. Wang, J. Yin, K. Bodduluri, M. Lambeta, L. Pinto, A. Rai, J. Malik, et al. Dexterity from smart lenses: Multi-fingered robot manipulation with in-the-wild human demonstrations.arXiv preprint arXiv:2511.16661, 2025

  37. [37]

    Flow Matching for Generative Modeling

    Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  38. [38]

    FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models

    W. Grathwohl, R. T. Chen, J. Bettencourt, I. Sutskever, and D. Duvenaud. Ffjord: Free-form continuous dynamics for scalable reversible generative models.arXiv preprint arXiv:1810.01367, 2018

  39. [39]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based gen- erative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  40. [40]

    S. Umeyama. Least-squares estimation of transformation parameters between two point pat- terns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991

  41. [41]

    K. Zakka. Mink: Python inverse kinematics based on MuJoCo, Feb. 2026. URLhttps: //github.com/kevinzakka/mink

  42. [42]

    C. Choy, J. Gwak, and S. Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3075–3084, 2019

  43. [43]

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. InAdvances in neural information processing systems, volume 30, 2017

  44. [44]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023. 13

  45. [45]

    J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, Y . Chen, H. Zheng, Y . Liu, S. Liu, B. Yin, W. He, H. Zhu, Y . Wang, J. Wang, M. Dong, Z. Zhang, Y . Kang, H. Zhang, X. Xu, Y . Zhang, Y . Wu, X. Zhou, and Z. Yang. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025

  46. [46]

    Y . Liu, Y . Yang, Y . Wang, X. Wu, J. Wang, Y . Yao, S. Schwertfeger, S. Yang, W. Wang, J. Yu, et al. Realdex: Towards human-like grasping for robotic dexterous hand.arXiv preprint arXiv:2402.13853, 2024

  47. [47]

    Huang, Z

    S. Huang, Z. Wang, P. Li, B. Jia, T. Liu, Y . Zhu, W. Liang, and S.-C. Zhu. Diffusion-based gen- eration, optimization, and planning in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16750–16761, 2023

  48. [48]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  49. [49]

    Levinson, C

    J. Levinson, C. Esteves, K. Chen, N. Snavely, A. Kanazawa, A. Rostamizadeh, and A. Maka- dia. An analysis of svd for deep rotation estimation.Advances in Neural Information Process- ing Systems, 33:22554–22565, 2020

  50. [50]

    A. J. Bose, T. Akhound-Sadegh, G. Huguet, K. Fatras, J. Rector-Brooks, C.-H. Liu, A. C. Nica, M. Korablyov, M. Bronstein, and A. Tong. Se (3)-stochastic flow matching for protein backbone generation.arXiv preprint arXiv:2310.02391, 2023

  51. [51]

    Eugenio, H

    C. Eugenio, H. Nick, A. Max, W. Tim, B. Thomas, and V . Abhinav. Learning robotic manipulation policies from point clouds with conditional flow matching.arXiv preprint arXiv:2409.07343, 2024

  52. [52]

    T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y . Chen, F. Yan, Z. Zeng, H. Zhang, F. Li, J. Yang, H. Li, Q. Jiang, and L. Zhang. Grounded sam: Assembling open- world models for diverse visual tasks.arXiv preprint arXiv:2401.14159, 2024

  53. [53]

    N. M. Boffi, M. S. Albergo, and E. Vanden-Eijnden. How to build a consistency model: Learn- ing flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025. 14