pith. sign in

arxiv: 2606.27036 · v1 · pith:DSGUINZCnew · submitted 2026-06-25 · 💻 cs.RO

RelAfford6D: Relational 6D Affordance Graphs for Constraint-Driven Robotic Manipulation

Pith reviewed 2026-06-26 05:11 UTC · model grok-4.3

classification 💻 cs.RO
keywords robotic manipulationaffordance graphskinematic constraintszero-shot generalizationSE(3) posesarticulated objectsvision foundation modelsconstraint satisfaction
0
0 comments X

The pith

RelAfford6D turns free-form instructions into semantic topology and then into kinematic constraints solved by tracking physical manifolds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training-free system that builds a relational graph from a language instruction to identify which object part interacts with which anchor. Vision models lift the graph nodes to exact 6D poses, allowing the robot to treat manipulation as satisfying explicit revolute or prismatic constraints rather than learning from data. If the conversion from language to metric constraints holds, the approach produces trajectories that generalize across object categories and recover from disturbances through closed-loop replanning. The authors report higher zero-shot success than data-driven baselines in both simulation and real-robot tests. A reader would care because the method offers an explicit physical bridge between open-ended instructions and precise control of articulated mechanisms.

Core claim

Given a free-form instruction, RelAfford6D deduces a semantic topology linking a primary interacting part to its physical anchor. These topological nodes are elevated to precise metric SE(3) poses via vision foundation models, after which downstream execution is formulated analytically as a kinematic constraint satisfaction problem. The robot then synthesizes continuous trajectories by tracking strictly defined physical manifolds such as revolute or prismatic orbits, augmented by closed-loop tracking for dynamic replanning.

What carries the argument

Relational 6D Affordance Graph that links semantic nodes to metric SE(3) poses so that manipulation becomes a kinematic constraint satisfaction problem on physical manifolds.

If this is right

  • Higher zero-shot success rates than data-driven baselines in both simulation and real-world settings.
  • Cross-category generalization without retraining or task-specific data collection.
  • Execution robustness through closed-loop replanning against external disturbances.
  • Training-free operation that formulates manipulation directly from language-derived constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could lower dependence on large-scale manipulation datasets by replacing learned policies with explicit constraint solving.
  • It opens the possibility of hybrid systems in which this graph-based layer handles precision while learned components manage perception noise.
  • Extending the same graph construction to multi-object or sequential tasks would require chaining multiple affordance graphs.
  • Real-world deployment would benefit from explicit uncertainty estimates on the vision-derived poses to trigger fallback behaviors.

Load-bearing premise

Vision foundation models can reliably extract the correct semantic topology from instructions and produce sufficiently accurate metric poses to define valid kinematic constraints.

What would settle it

A test set of instructions on novel articulated objects where the extracted graph nodes yield poses that generate colliding or kinematically invalid trajectories.

Figures

Figures reproduced from arXiv: 2606.27036 by Bayram Bayramli, Guodong Zhang, Hongtao Lu, Qichen He, Qiuchang Li, Shaokai Wu, Wenyuan Xie, Yanbiao Ji, Yue Ding.

Figure 1
Figure 1. Figure 1: RelAfford6D bridges observations and rigorous physical control for open￾world manipulation. From a text prompt, it deduces a Relational 6D Affordance Graph that links a primary interacting part to its physical anchor. By elevating the topolog￾ical nodes into metric SE(3) poses, the system formulates execution as a continuous kinematic constraint process, enabling robust, training-free closed-loop manipulat… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the RelAfford6D framework. Given a free-form instruction, (a)Semantic Topology Generation deduces a relational graph that explicitly links a primary interacting part to its physical anchor. Then, (b)Metric Visual Grounding elevates these topological nodes into an instantiated spatial graph Ψ = {(T ∗ P , T ∗ A)} via pixel-aligned SE(3) pose estimation. Finally, (c)Constraint-Driven Kinematic Ex￾… view at source ↗
Figure 3
Figure 3. Figure 3: Semantic Topology Generation example dataset with part-level semantics and mobility attributes, to construct a struc￾tured kinematic knowledge base. By integrating retrieval-augmented generation (RAG) with LLM reasoning, we perform zero-shot Semantic Topology Genera￾tion, effectively deducing the relational node topology of unseen objects without heuristic parsing. Structured Kinematic Knowledge Base. Firs… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of the Metric Visual Grounding process. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of closed-loop action execution. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Real-world experiments on Realman RM75. tracking remains mathematically valid. This decoupled design provides a far more robust execution anchor than global visual embeddings. 4.4 Real-world Evaluation To validate the effectiveness and robustness of our training-free framework in un￾constrained physical environments, we deploy RelAfford6D on a Realman RM75 robotic arm equipped with a wrist-mounted Intel Re… view at source ↗
read the original abstract

Bridging abstract semantics and precise physical control remains a fundamental challenge in open-world robotic manipulation. While recent data-driven policies show promise, their reliance on isolated contact points or latent affordance embeddings lacks the rigorous kinematic constraints necessary for complex articulated objects.To overcome the limitation, we introduce RelAfford6D, a novel training-free framework centered on a Relational 6D Affordance Graph. Given a free-form instruction, our system deduces a semantic topology linking a primary interacting part to its physical anchor. By elevating these topological nodes into precise metric $SE(3)$ poses via vision foundation models, we analytically formulate downstream execution as a kinematic constraint satisfaction problem. The robot synthesizes continuous trajectories by tracking strictly defined physical manifolds (e.g., revolute or prismatic orbits). Coupled with a closed-loop tracking mechanism for dynamic replanning against disturbances, our physically grounded approach achieves superior zero-shot success rates, cross-category generalization and execution robustness in both simulation and the real world environments, outperforming existing data-driven baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces RelAfford6D, a training-free framework that constructs a Relational 6D Affordance Graph from free-form language instructions. It deduces semantic topology between interacting parts and anchors, elevates nodes to metric SE(3) poses using vision foundation models, analytically formulates execution as a kinematic constraint satisfaction problem over revolute or prismatic manifolds, and employs closed-loop tracking for replanning. The central claim is that this yields superior zero-shot success rates, cross-category generalization, and robustness compared with data-driven baselines in both simulation and real-world settings.

Significance. If the performance claims hold with rigorous validation, the work would provide a concrete analytical bridge between semantic language understanding and precise kinematic control for articulated objects, reducing reliance on task-specific training data and potentially improving generalization in open-world manipulation.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'superior zero-shot success rates ... outperforming existing data-driven baselines' is presented without any quantitative metrics, baselines, error bars, success rates, or experimental protocol, rendering the central empirical claim unevaluable from the manuscript.
  2. [Abstract] The weakest assumption (VFM-derived SE(3) poses suffice to define valid kinematic manifolds) is load-bearing for the entire pipeline yet unsupported by any reported pose-error statistics, ablation on metric accuracy versus semantic topology, or failure cases when VFM output deviates from ground-truth articulation axes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'superior zero-shot success rates ... outperforming existing data-driven baselines' is presented without any quantitative metrics, baselines, error bars, success rates, or experimental protocol, rendering the central empirical claim unevaluable from the manuscript.

    Authors: We agree that the abstract should include key quantitative support for the central claims to allow evaluation without requiring the full text. The detailed results (success rates, baselines, error bars, and protocols) appear in Sections 4 and 5, but the abstract currently summarizes them only qualitatively. We will revise the abstract to incorporate representative metrics from the experiments. revision: yes

  2. Referee: [Abstract] The weakest assumption (VFM-derived SE(3) poses suffice to define valid kinematic manifolds) is load-bearing for the entire pipeline yet unsupported by any reported pose-error statistics, ablation on metric accuracy versus semantic topology, or failure cases when VFM output deviates from ground-truth articulation axes.

    Authors: The referee is correct that explicit validation of the VFM pose assumption is needed. The current manuscript reports overall task success and qualitative examples but does not include dedicated pose-error statistics, ablations separating metric accuracy from topology, or systematic failure-case analysis for VFM deviations. We will add these analyses (pose errors vs. ground truth, relevant ablations, and failure modes) in a new subsection or appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; analytical training-free method relies on external VFMs and standard kinematics.

full rationale

The provided abstract and description present RelAfford6D as a training-free framework that deduces semantic topology from free-form instructions, elevates nodes to SE(3) poses via vision foundation models, and analytically formulates kinematic constraints (revolute/prismatic orbits) for trajectory synthesis. No equations, fitted parameters, or self-referential definitions appear. The derivation chain invokes external vision models and classical kinematic constraint satisfaction rather than redefining inputs as outputs or importing uniqueness via self-citation. The central performance claims are positioned as empirical outcomes of this pipeline, not tautological by construction. This is the expected non-finding for an analytical method without visible self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the unverified reliability of vision foundation models for precise pose estimation and on the assumption that free-form instructions yield unambiguous semantic topologies; no free parameters or invented physical entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Vision foundation models supply accurate metric SE(3) poses from single or few images for the identified affordance nodes
    Central to elevating topological nodes into executable constraints (abstract).
  • domain assumption Semantic topology deduction from free-form language instructions is sufficiently reliable to define primary interacting part and physical anchor
    Required before pose lifting and constraint formulation (abstract).
invented entities (1)
  • Relational 6D Affordance Graph no independent evidence
    purpose: Links primary interacting part to its physical anchor for downstream kinematic solving
    Core novel structure introduced to bridge semantics and constraints

pith-pipeline@v0.9.1-grok · 5738 in / 1350 out tokens · 41384 ms · 2026-06-26T05:11:05.389172+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 10 internal anchors

  1. [1]

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

  2. [2]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

  3. [3]

    ShapeNet: An Information-Rich 3D Model Repository

    Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)

  4. [4]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, J., Gao, D., Lin, K.Q., Shou, M.Z.: Affordance grounding from demon- stration video to target image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6799–6808 (2023)

  5. [5]

    The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

    Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

  6. [6]

    Davidson, J.K., Hunt, K.H., Pennock, G.R.: Robots and screw theory: applications of kinematics and statics to robotics. J. Mech. Des.126(4), 763–764 (2004)

  7. [7]

    In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3d affordancenet: A benchmark for vi- sual object affordance understanding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1778–1787 (2021)

  8. [8]

    In: 2018 IEEE international conference on robotics and automation (ICRA)

    Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection. In: 2018 IEEE international conference on robotics and automation (ICRA). pp. 5882–5889. IEEE (2018)

  9. [9]

    arXiv preprint arXiv:2205.04382 (2022)

    Eisner, B., Zhang, H., Held, D.: Flowbot3d: Learning 3d articulation flow to ma- nipulate articulated objects. arXiv preprint arXiv:2205.04382 (2022)

  10. [10]

    arXiv preprint arXiv:2209.12941 (2022)

    Geng, Y., An, B., Geng, H., Chen, Y., Yang, Y., Dong, H.: End-to-end affordance learning for robotic manipulation. arXiv preprint arXiv:2209.12941 (2022)

  11. [11]

    Journal of Image and Graphics31(6), 1911–1941 (2026)

    He, Y., Lu, H., Wang, D., Li, S., Li, Z., Liu, Y., Zhao, J., Ruan, S.: Vision-language- action models: Current developments and frontier advances. Journal of Image and Graphics31(6), 1911–1941 (2026)

  12. [12]

    Huang, W., Wang, C., Li, Y., Zhang, R., Fei-Fei, L.: Rekep: Spatio-temporal rea- soning of relational keypoint constraints for robotic manipulation (2024),https: //arxiv.org/abs/2409.01652

  13. [13]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)

  14. [14]

    Jiang, Q., Li, F., Zeng, Z., Ren, T., Liu, S., Zhang, L.: T-rex2: Towards generic object detection via text-visual prompt synergy (2024)

  15. [15]

    In: Proceedings of the European conference on computer vision (ECCV)

    Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Proceedings of the European conference on computer vision (ECCV). pp. 371–386 (2018)

  16. [16]

    In: Experimental Robotics: The 12th International Symposium on Experimental Robotics

    Katz, D., Orthey, A., Brock, O.: Interactive perception of articulated objects. In: Experimental Robotics: The 12th International Symposium on Experimental Robotics. pp. 301–315. Springer (2014)

  17. [17]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

  18. [18]

    In: Proceedings 2000 ICRA

    Kuffner, J., LaValle, S.: Rrt-connect: An efficient approach to single-query path planning. In: Proceedings 2000 ICRA. Millennium Conference. IEEE Interna- tional Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065). vol. 2, pp. 995–1001 vol.2 (2000).https://doi.org/10.1109/ ROBOT.2000.844730

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, G., Jampani, V., Sun, D., Sevilla-Lara, L.: Locate: Localize and transfer ob- ject parts for weakly supervised affordance grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10922– 10931 (2023) RelAfford6D 17

  20. [20]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, X., Zhang, M., Geng, Y., Geng, H., Long, Y., Shen, Y., Zhang, R., Liu, J., Dong, H.: Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18061–18070 (2024)

  21. [21]

    Ling, S., Wang, Y., Wu, S., Zhuang, Y., Xu, T., Li, Y., Liu, C., Dong, H.: Artic- ulated object manipulation with coarse-to-fine affordance for mitigating the effect of point cloud noise (2024),https://arxiv.org/abs/2402.18699

  22. [22]

    Moka: Open-world robotic manipulation through mark-based visual prompting.arXiv preprint arXiv:2403.03174, 2024a

    Liu, F., Fang, K., Abbeel, P., Levine, S.: Moka: Open-world robotic manipulation through mark-based visual prompting. arXiv preprint arXiv:2403.03174 (2024)

  23. [23]

    Advances in Neural Information Processing Systems37, 40085–40110 (2024)

    Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S.: Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation. Advances in Neural Information Processing Systems37, 40085–40110 (2024)

  24. [24]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S.: Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 40085–40110....

  25. [25]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, L., Xu, W., Fu, H., Qian, S., Yu, Q., Han, Y., Lu, C.: Akb-48: A real-world articulated object knowledge base. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14809–14818 (2022)

  26. [26]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024)

  27. [27]

    Lu, D., Kong, L., Huang, T., Lee, G.H.: Geal: Generalizable 3d affordance learning with cross-modal consistency (2024),https://arxiv.org/abs/2412.09511

  28. [28]

    IEEE Transactions on Artificial Intelligence4(5), 1186–1198 (2022)

    Lu, L., Zhai, W., Luo, H., Kang, Y., Cao, Y.: Phrase-based affordance detection via cyclic bilateral interaction. IEEE Transactions on Artificial Intelligence4(5), 1186–1198 (2022)

  29. [29]

    arXiv preprint arXiv:2209.05672 (2022)

    Mahalingam, D., Chakraborty, N.: Human-guided planning for complex manipu- lation tasks using the screw geometry of motion. arXiv preprint arXiv:2209.05672 (2022)

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Mo, K., Guibas, L.J., Mukadam, M., Gupta, A., Tulsiani, S.: Where2act: From pixels to actions for articulated 3d objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6813–6823 (2021)

  31. [31]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object un- derstanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 909–918 (2019)

  32. [32]

    Journal of Image and Graphics31(6), 2017–2025 (2026).https://doi.org/10.11834/jig.260059

    Mu, Y., Zhao, H., Hu, R., Zhang, L., Li, H., Yang, J., Wang, J., Han, L., Su, Y., Xu, K., Yang, Y., Li, J., Dai, R., Chen, B., Liu, Y., Yi, L.: Frontiers and prospects of embodied ai: Evolution of data, models, and systems. Journal of Image and Graphics31(6), 2017–2025 (2026).https://doi.org/10.11834/jig.260059

  33. [33]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Nguyen, T., Vu, M.N., Huang, B., Van Vo, T., Truong, V., Le, N., Vo, T., Le, B., Nguyen, A.: Language-conditioned affordance-pose detection in 3d point clouds. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 3071–3078. IEEE (2024) 18 G. Zhang et al

  34. [34]

    In: 2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

    Nguyen, T., Vu, M.N., Vuong, A., Nguyen, D., Vo, T., Le, N., Nguyen, A.: Open- vocabulary affordance detection in 3d point clouds. In: 2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 5692–5698. IEEE (2023)

  35. [35]

    Pan, M., Zhang, J., Wu, T., Zhao, Y., Gao, W., Dong, H.: Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints (2025),https://arxiv.org/abs/2501.03841

  36. [36]

    In: Conference on robot learning

    Qin,Y.,Chen,R.,Zhu,H.,Song,M.,Xu,J.,Su,H.:S4g:Amodalsingle-viewsingle- shot se (3) grasp detection in cluttered scenes. In: Conference on robot learning. pp. 53–65. PMLR (2020)

  37. [37]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019),https: //arxiv.org/abs/1908.10084

  38. [38]

    org/abs/2405.10300

    Ren, T., Jiang, Q., Liu, S., Zeng, Z., Liu, W., Gao, H., Huang, H., Ma, Z., Jiang, X., Chen, Y., Xiong, Y., Zhang, H., Li, F., Tang, P., Yu, K., Zhang, L.: Grounding dino 1.5: Advance the "edge" of open-set object detection (2024),https://arxiv. org/abs/2405.10300

  39. [39]

    IEEE Robotics and Automa- tion Letters5(3), 4978–4985 (2020)

    Song, S., Zeng, A., Lee, J., Funkhouser, T.: Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. IEEE Robotics and Automa- tion Letters5(3), 4978–4985 (2020)

  40. [40]

    Tang, Y., Huang, W., Wang, Y., Li, C., Yuan, R., Zhang, R., Wu, J., Fei-Fei, L.: Uad: Unsupervised affordance distillation for generalization in robotic manipula- tion (2025),https://arxiv.org/abs/2506.09284

  41. [41]

    Octo: An Open-Source Generalist Robot Policy

    Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

  42. [42]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

    Wang, J., Dasari, S., Srirama, M.K., Tulsiani, S., Gupta, A.: Manipulate by seeing: Creating manipulation controllers from pre-trained representations. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 3859– 3868 (2023)

  43. [43]

    arXiv preprint arXiv:2309.16118 (2023)

    Wang, Y., Zhang, M., Li, Z., Kelestemur, T., Driggs-Campbell, K., Wu, J., Fei- Fei, L., Li, Y.: D3 Fields: Dynamic 3d descriptor fields for zero-shot generalizable rearrangement. arXiv preprint arXiv:2309.16118 (2023)

  44. [44]

    arXiv preprint arXiv:2502.11124 (2025)

    Wang, Y., Zhang, X., Wu, R., Li, Y., Shen, Y., Wu, M., He, Z., Wang, Y., Dong, H.: Adamanip: Adaptive articulated object manipulation environments and policy learning. arXiv preprint arXiv:2502.11124 (2025)

  45. [45]

    In: CVPR (2023)

    Wen, B., Tremblay, J., Blukis, V., Tyree, S., Müller, T., Evans, A., Fox, D., Kautz, J., Birchfield, S.: BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In: CVPR (2023)

  46. [46]

    In: CVPR (2024)

    Wen, B., Yang, W., Kautz, J., Birchfield, S.: FoundationPose: Unified 6d pose estimation and tracking of novel objects. In: CVPR (2024)

  47. [47]

    arXiv preprint arXiv:2106.14440 (2021)

    Wu,R.,Zhao,Y.,Mo,K.,Guo,Z.,Wang,Y.,Wu,T.,Fan,Q.,Chen,X.,Guibas,L., Dong, H.: Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects. arXiv preprint arXiv:2106.14440 (2021)

  48. [48]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., et al.: Sapien: A simulated part-based interactive environment. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11097–11107 (2020) RelAfford6D 19

  49. [49]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Yang, X., Gong, X.: Foundation model assisted weakly supervised semantic seg- mentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 523–532 (2024)

  50. [50]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

  51. [51]

    arXiv preprint arXiv:2306.12893 (2023)

    Zhang, H., Eisner, B., Held, D.: Flowbot++: Learning generalized articulated ob- jects manipulation via articulation projection. arXiv preprint arXiv:2306.12893 (2023)

  52. [52]

    arXiv preprint arXiv:2507.18276 (2025)

    Zhang, X., Wang, Y., Wu, R., Xu, K., Li, Y., Xiang, L., Dong, H., He, Z.: Adaptive articulated object manipulation on the fly with foundation model reasoning and part grounding. arXiv preprint arXiv:2507.18276 (2025)