RelAfford6D: Relational 6D Affordance Graphs for Constraint-Driven Robotic Manipulation

Bayram Bayramli; Guodong Zhang; Hongtao Lu; Qichen He; Qiuchang Li; Shaokai Wu; Wenyuan Xie; Yanbiao Ji; Yue Ding

arxiv: 2606.27036 · v1 · pith:DSGUINZCnew · submitted 2026-06-25 · 💻 cs.RO

RelAfford6D: Relational 6D Affordance Graphs for Constraint-Driven Robotic Manipulation

Guodong Zhang , Qichen He , Wenyuan Xie , Shaokai Wu , Yanbiao Ji , Qiuchang Li , Bayram Bayramli , Yue Ding

show 1 more author

Hongtao Lu

This is my paper

Pith reviewed 2026-06-26 05:11 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic manipulationaffordance graphskinematic constraintszero-shot generalizationSE(3) posesarticulated objectsvision foundation modelsconstraint satisfaction

0 comments

The pith

RelAfford6D turns free-form instructions into semantic topology and then into kinematic constraints solved by tracking physical manifolds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a training-free system that builds a relational graph from a language instruction to identify which object part interacts with which anchor. Vision models lift the graph nodes to exact 6D poses, allowing the robot to treat manipulation as satisfying explicit revolute or prismatic constraints rather than learning from data. If the conversion from language to metric constraints holds, the approach produces trajectories that generalize across object categories and recover from disturbances through closed-loop replanning. The authors report higher zero-shot success than data-driven baselines in both simulation and real-robot tests. A reader would care because the method offers an explicit physical bridge between open-ended instructions and precise control of articulated mechanisms.

Core claim

Given a free-form instruction, RelAfford6D deduces a semantic topology linking a primary interacting part to its physical anchor. These topological nodes are elevated to precise metric SE(3) poses via vision foundation models, after which downstream execution is formulated analytically as a kinematic constraint satisfaction problem. The robot then synthesizes continuous trajectories by tracking strictly defined physical manifolds such as revolute or prismatic orbits, augmented by closed-loop tracking for dynamic replanning.

What carries the argument

Relational 6D Affordance Graph that links semantic nodes to metric SE(3) poses so that manipulation becomes a kinematic constraint satisfaction problem on physical manifolds.

If this is right

Higher zero-shot success rates than data-driven baselines in both simulation and real-world settings.
Cross-category generalization without retraining or task-specific data collection.
Execution robustness through closed-loop replanning against external disturbances.
Training-free operation that formulates manipulation directly from language-derived constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could lower dependence on large-scale manipulation datasets by replacing learned policies with explicit constraint solving.
It opens the possibility of hybrid systems in which this graph-based layer handles precision while learned components manage perception noise.
Extending the same graph construction to multi-object or sequential tasks would require chaining multiple affordance graphs.
Real-world deployment would benefit from explicit uncertainty estimates on the vision-derived poses to trigger fallback behaviors.

Load-bearing premise

Vision foundation models can reliably extract the correct semantic topology from instructions and produce sufficiently accurate metric poses to define valid kinematic constraints.

What would settle it

A test set of instructions on novel articulated objects where the extracted graph nodes yield poses that generate colliding or kinematically invalid trajectories.

Figures

Figures reproduced from arXiv: 2606.27036 by Bayram Bayramli, Guodong Zhang, Hongtao Lu, Qichen He, Qiuchang Li, Shaokai Wu, Wenyuan Xie, Yanbiao Ji, Yue Ding.

**Figure 1.** Figure 1: RelAfford6D bridges observations and rigorous physical control for openworld manipulation. From a text prompt, it deduces a Relational 6D Affordance Graph that links a primary interacting part to its physical anchor. By elevating the topological nodes into metric SE(3) poses, the system formulates execution as a continuous kinematic constraint process, enabling robust, training-free closed-loop manipulat… view at source ↗

**Figure 2.** Figure 2: Overview of the RelAfford6D framework. Given a free-form instruction, (a)Semantic Topology Generation deduces a relational graph that explicitly links a primary interacting part to its physical anchor. Then, (b)Metric Visual Grounding elevates these topological nodes into an instantiated spatial graph Ψ = {(T ∗ P , T ∗ A)} via pixel-aligned SE(3) pose estimation. Finally, (c)Constraint-Driven Kinematic Ex… view at source ↗

**Figure 3.** Figure 3: Semantic Topology Generation example dataset with part-level semantics and mobility attributes, to construct a structured kinematic knowledge base. By integrating retrieval-augmented generation (RAG) with LLM reasoning, we perform zero-shot Semantic Topology Generation, effectively deducing the relational node topology of unseen objects without heuristic parsing. Structured Kinematic Knowledge Base. Firs… view at source ↗

**Figure 4.** Figure 4: Visualization of the Metric Visual Grounding process. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of closed-loop action execution. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Real-world experiments on Realman RM75. tracking remains mathematically valid. This decoupled design provides a far more robust execution anchor than global visual embeddings. 4.4 Real-world Evaluation To validate the effectiveness and robustness of our training-free framework in unconstrained physical environments, we deploy RelAfford6D on a Realman RM75 robotic arm equipped with a wrist-mounted Intel Re… view at source ↗

read the original abstract

Bridging abstract semantics and precise physical control remains a fundamental challenge in open-world robotic manipulation. While recent data-driven policies show promise, their reliance on isolated contact points or latent affordance embeddings lacks the rigorous kinematic constraints necessary for complex articulated objects.To overcome the limitation, we introduce RelAfford6D, a novel training-free framework centered on a Relational 6D Affordance Graph. Given a free-form instruction, our system deduces a semantic topology linking a primary interacting part to its physical anchor. By elevating these topological nodes into precise metric $SE(3)$ poses via vision foundation models, we analytically formulate downstream execution as a kinematic constraint satisfaction problem. The robot synthesizes continuous trajectories by tracking strictly defined physical manifolds (e.g., revolute or prismatic orbits). Coupled with a closed-loop tracking mechanism for dynamic replanning against disturbances, our physically grounded approach achieves superior zero-shot success rates, cross-category generalization and execution robustness in both simulation and the real world environments, outperforming existing data-driven baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RelAfford6D sketches a training-free pipeline that turns language instructions into relational 6D graphs and then tracks analytic kinematic manifolds, but the abstract supplies no numbers or ablations to show the approach actually works better than baselines.

read the letter

The main point is that this paper describes a training-free method for articulated manipulation that builds a relational graph linking a primary part to its anchor, lifts the nodes to SE(3) poses with vision foundation models, and then generates trajectories by tracking revolute or prismatic manifolds while adding closed-loop replanning. The abstract positions this as more generalizable than learned policies.

What is new is the explicit step of elevating semantic topology into metric 6D constraints that are solved analytically rather than through learned contact points or latent embeddings. The closed-loop mechanism for handling disturbances is a straightforward but useful addition that keeps the method grounded in physical manifolds. This framing does address a real gap between high-level instructions and precise control on objects with joints.

The soft spot is the complete lack of any quantitative evidence. The abstract claims superior zero-shot success rates, cross-category generalization, and robustness in simulation and on real robots, yet gives no success percentages, no baseline comparisons, no test objects, and no error statistics on the SE(3) poses produced by the vision models. The assumption that those models will output poses accurate enough to define valid kinematic constraints is stated but not measured or ablated. If the full paper contains those results and they hold, the idea gains traction; without them the performance edge cannot be attributed to the kinematic formulation.

This is aimed at roboticists working on manipulation of articulated objects who want structured alternatives to end-to-end learning. A reader looking for concrete ways to combine language, vision, and analytic constraints could extract useful formulation details even if the results still need checking.

I would bring the full version to a reading group only after seeing the experiments. It is not ready to cite. It deserves peer review if the manuscript includes solid quantitative results and ablations that address the pose-accuracy issue; otherwise the missing evidence makes evaluation difficult.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces RelAfford6D, a training-free framework that constructs a Relational 6D Affordance Graph from free-form language instructions. It deduces semantic topology between interacting parts and anchors, elevates nodes to metric SE(3) poses using vision foundation models, analytically formulates execution as a kinematic constraint satisfaction problem over revolute or prismatic manifolds, and employs closed-loop tracking for replanning. The central claim is that this yields superior zero-shot success rates, cross-category generalization, and robustness compared with data-driven baselines in both simulation and real-world settings.

Significance. If the performance claims hold with rigorous validation, the work would provide a concrete analytical bridge between semantic language understanding and precise kinematic control for articulated objects, reducing reliance on task-specific training data and potentially improving generalization in open-world manipulation.

major comments (2)

[Abstract] Abstract: the assertion of 'superior zero-shot success rates ... outperforming existing data-driven baselines' is presented without any quantitative metrics, baselines, error bars, success rates, or experimental protocol, rendering the central empirical claim unevaluable from the manuscript.
[Abstract] The weakest assumption (VFM-derived SE(3) poses suffice to define valid kinematic manifolds) is load-bearing for the entire pipeline yet unsupported by any reported pose-error statistics, ablation on metric accuracy versus semantic topology, or failure cases when VFM output deviates from ground-truth articulation axes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'superior zero-shot success rates ... outperforming existing data-driven baselines' is presented without any quantitative metrics, baselines, error bars, success rates, or experimental protocol, rendering the central empirical claim unevaluable from the manuscript.

Authors: We agree that the abstract should include key quantitative support for the central claims to allow evaluation without requiring the full text. The detailed results (success rates, baselines, error bars, and protocols) appear in Sections 4 and 5, but the abstract currently summarizes them only qualitatively. We will revise the abstract to incorporate representative metrics from the experiments. revision: yes
Referee: [Abstract] The weakest assumption (VFM-derived SE(3) poses suffice to define valid kinematic manifolds) is load-bearing for the entire pipeline yet unsupported by any reported pose-error statistics, ablation on metric accuracy versus semantic topology, or failure cases when VFM output deviates from ground-truth articulation axes.

Authors: The referee is correct that explicit validation of the VFM pose assumption is needed. The current manuscript reports overall task success and qualitative examples but does not include dedicated pose-error statistics, ablations separating metric accuracy from topology, or systematic failure-case analysis for VFM deviations. We will add these analyses (pose errors vs. ground truth, relevant ablations, and failure modes) in a new subsection or appendix. revision: yes

Circularity Check

0 steps flagged

No circularity; analytical training-free method relies on external VFMs and standard kinematics.

full rationale

The provided abstract and description present RelAfford6D as a training-free framework that deduces semantic topology from free-form instructions, elevates nodes to SE(3) poses via vision foundation models, and analytically formulates kinematic constraints (revolute/prismatic orbits) for trajectory synthesis. No equations, fitted parameters, or self-referential definitions appear. The derivation chain invokes external vision models and classical kinematic constraint satisfaction rather than redefining inputs as outputs or importing uniqueness via self-citation. The central performance claims are positioned as empirical outcomes of this pipeline, not tautological by construction. This is the expected non-finding for an analytical method without visible self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the unverified reliability of vision foundation models for precise pose estimation and on the assumption that free-form instructions yield unambiguous semantic topologies; no free parameters or invented physical entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Vision foundation models supply accurate metric SE(3) poses from single or few images for the identified affordance nodes
Central to elevating topological nodes into executable constraints (abstract).
domain assumption Semantic topology deduction from free-form language instructions is sufficiently reliable to define primary interacting part and physical anchor
Required before pose lifting and constraint formulation (abstract).

invented entities (1)

Relational 6D Affordance Graph no independent evidence
purpose: Links primary interacting part to its physical anchor for downstream kinematic solving
Core novel structure introduced to bridge semantics and constraints

pith-pipeline@v0.9.1-grok · 5738 in / 1350 out tokens · 41384 ms · 2026-06-26T05:11:05.389172+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 26 canonical work pages · 10 internal anchors

[1]

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

ShapeNet: An Information-Rich 3D Model Repository

Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, J., Gao, D., Lin, K.Q., Shou, M.Z.: Affordance grounding from demon- stration video to target image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6799–6808 (2023)

2023
[5]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025
[6]

Davidson, J.K., Hunt, K.H., Pennock, G.R.: Robots and screw theory: applications of kinematics and statics to robotics. J. Mech. Des.126(4), 763–764 (2004)

2004
[7]

In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3d affordancenet: A benchmark for vi- sual object affordance understanding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1778–1787 (2021)

2021
[8]

In: 2018 IEEE international conference on robotics and automation (ICRA)

Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection. In: 2018 IEEE international conference on robotics and automation (ICRA). pp. 5882–5889. IEEE (2018)

2018
[9]

arXiv preprint arXiv:2205.04382 (2022)

Eisner, B., Zhang, H., Held, D.: Flowbot3d: Learning 3d articulation flow to ma- nipulate articulated objects. arXiv preprint arXiv:2205.04382 (2022)

work page arXiv 2022
[10]

arXiv preprint arXiv:2209.12941 (2022)

Geng, Y., An, B., Geng, H., Chen, Y., Yang, Y., Dong, H.: End-to-end affordance learning for robotic manipulation. arXiv preprint arXiv:2209.12941 (2022)

work page arXiv 2022
[11]

Journal of Image and Graphics31(6), 1911–1941 (2026)

He, Y., Lu, H., Wang, D., Li, S., Li, Z., Liu, Y., Zhao, J., Ruan, S.: Vision-language- action models: Current developments and frontier advances. Journal of Image and Graphics31(6), 1911–1941 (2026)

1911
[12]

Huang, W., Wang, C., Li, Y., Zhang, R., Fei-Fei, L.: Rekep: Spatio-temporal rea- soning of relational keypoint constraints for robotic manipulation (2024),https: //arxiv.org/abs/2409.01652

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Jiang, Q., Li, F., Zeng, Z., Ren, T., Liu, S., Zhang, L.: T-rex2: Towards generic object detection via text-visual prompt synergy (2024)

2024
[15]

In: Proceedings of the European conference on computer vision (ECCV)

Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Proceedings of the European conference on computer vision (ECCV). pp. 371–386 (2018)

2018
[16]

In: Experimental Robotics: The 12th International Symposium on Experimental Robotics

Katz, D., Orthey, A., Brock, O.: Interactive perception of articulated objects. In: Experimental Robotics: The 12th International Symposium on Experimental Robotics. pp. 301–315. Springer (2014)

2014
[17]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

In: Proceedings 2000 ICRA

Kuffner, J., LaValle, S.: Rrt-connect: An efficient approach to single-query path planning. In: Proceedings 2000 ICRA. Millennium Conference. IEEE Interna- tional Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065). vol. 2, pp. 995–1001 vol.2 (2000).https://doi.org/10.1109/ ROBOT.2000.844730

work page arXiv 2000
[19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, G., Jampani, V., Sun, D., Sevilla-Lara, L.: Locate: Localize and transfer ob- ject parts for weakly supervised affordance grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10922– 10931 (2023) RelAfford6D 17

2023
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, X., Zhang, M., Geng, Y., Geng, H., Long, Y., Shen, Y., Zhang, R., Liu, J., Dong, H.: Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18061–18070 (2024)

2024
[21]

Ling, S., Wang, Y., Wu, S., Zhuang, Y., Xu, T., Li, Y., Liu, C., Dong, H.: Artic- ulated object manipulation with coarse-to-fine affordance for mitigating the effect of point cloud noise (2024),https://arxiv.org/abs/2402.18699

work page arXiv 2024
[22]

Moka: Open-world robotic manipulation through mark-based visual prompting.arXiv preprint arXiv:2403.03174, 2024a

Liu, F., Fang, K., Abbeel, P., Levine, S.: Moka: Open-world robotic manipulation through mark-based visual prompting. arXiv preprint arXiv:2403.03174 (2024)

work page arXiv 2024
[23]

Advances in Neural Information Processing Systems37, 40085–40110 (2024)

Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S.: Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation. Advances in Neural Information Processing Systems37, 40085–40110 (2024)

2024
[24]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S.: Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 40085–40110....

2024
[25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, L., Xu, W., Fu, H., Qian, S., Yu, Q., Han, Y., Lu, C.: Akb-48: A real-world articulated object knowledge base. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14809–14818 (2022)

2022
[26]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Lu, D., Kong, L., Huang, T., Lee, G.H.: Geal: Generalizable 3d affordance learning with cross-modal consistency (2024),https://arxiv.org/abs/2412.09511

work page arXiv 2024
[28]

IEEE Transactions on Artificial Intelligence4(5), 1186–1198 (2022)

Lu, L., Zhai, W., Luo, H., Kang, Y., Cao, Y.: Phrase-based affordance detection via cyclic bilateral interaction. IEEE Transactions on Artificial Intelligence4(5), 1186–1198 (2022)

2022
[29]

arXiv preprint arXiv:2209.05672 (2022)

Mahalingam, D., Chakraborty, N.: Human-guided planning for complex manipu- lation tasks using the screw geometry of motion. arXiv preprint arXiv:2209.05672 (2022)

work page arXiv 2022
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Mo, K., Guibas, L.J., Mukadam, M., Gupta, A., Tulsiani, S.: Where2act: From pixels to actions for articulated 3d objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6813–6823 (2021)

2021
[31]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object un- derstanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 909–918 (2019)

2019
[32]

Journal of Image and Graphics31(6), 2017–2025 (2026).https://doi.org/10.11834/jig.260059

Mu, Y., Zhao, H., Hu, R., Zhang, L., Li, H., Yang, J., Wang, J., Han, L., Su, Y., Xu, K., Yang, Y., Li, J., Dai, R., Chen, B., Liu, Y., Yi, L.: Frontiers and prospects of embodied ai: Evolution of data, models, and systems. Journal of Image and Graphics31(6), 2017–2025 (2026).https://doi.org/10.11834/jig.260059

work page doi:10.11834/jig.260059 2017
[33]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Nguyen, T., Vu, M.N., Huang, B., Van Vo, T., Truong, V., Le, N., Vo, T., Le, B., Nguyen, A.: Language-conditioned affordance-pose detection in 3d point clouds. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 3071–3078. IEEE (2024) 18 G. Zhang et al

2024
[34]

In: 2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

Nguyen, T., Vu, M.N., Vuong, A., Nguyen, D., Vo, T., Le, N., Nguyen, A.: Open- vocabulary affordance detection in 3d point clouds. In: 2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 5692–5698. IEEE (2023)

2023
[35]

Pan, M., Zhang, J., Wu, T., Zhao, Y., Gao, W., Dong, H.: Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints (2025),https://arxiv.org/abs/2501.03841

work page arXiv 2025
[36]

In: Conference on robot learning

Qin,Y.,Chen,R.,Zhu,H.,Song,M.,Xu,J.,Su,H.:S4g:Amodalsingle-viewsingle- shot se (3) grasp detection in cluttered scenes. In: Conference on robot learning. pp. 53–65. PMLR (2020)

2020
[37]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019),https: //arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[38]

org/abs/2405.10300

Ren, T., Jiang, Q., Liu, S., Zeng, Z., Liu, W., Gao, H., Huang, H., Ma, Z., Jiang, X., Chen, Y., Xiong, Y., Zhang, H., Li, F., Tang, P., Yu, K., Zhang, L.: Grounding dino 1.5: Advance the "edge" of open-set object detection (2024),https://arxiv. org/abs/2405.10300

work page arXiv 2024
[39]

IEEE Robotics and Automa- tion Letters5(3), 4978–4985 (2020)

Song, S., Zeng, A., Lee, J., Funkhouser, T.: Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. IEEE Robotics and Automa- tion Letters5(3), 4978–4985 (2020)

2020
[40]

Tang, Y., Huang, W., Wang, Y., Li, C., Yuan, R., Zhang, R., Wu, J., Fei-Fei, L.: Uad: Unsupervised affordance distillation for generalization in robotic manipula- tion (2025),https://arxiv.org/abs/2506.09284

work page arXiv 2025
[41]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Wang, J., Dasari, S., Srirama, M.K., Tulsiani, S., Gupta, A.: Manipulate by seeing: Creating manipulation controllers from pre-trained representations. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 3859– 3868 (2023)

2023
[43]

arXiv preprint arXiv:2309.16118 (2023)

Wang, Y., Zhang, M., Li, Z., Kelestemur, T., Driggs-Campbell, K., Wu, J., Fei- Fei, L., Li, Y.: D3 Fields: Dynamic 3d descriptor fields for zero-shot generalizable rearrangement. arXiv preprint arXiv:2309.16118 (2023)

work page arXiv 2023
[44]

arXiv preprint arXiv:2502.11124 (2025)

Wang, Y., Zhang, X., Wu, R., Li, Y., Shen, Y., Wu, M., He, Z., Wang, Y., Dong, H.: Adamanip: Adaptive articulated object manipulation environments and policy learning. arXiv preprint arXiv:2502.11124 (2025)

work page arXiv 2025
[45]

In: CVPR (2023)

Wen, B., Tremblay, J., Blukis, V., Tyree, S., Müller, T., Evans, A., Fox, D., Kautz, J., Birchfield, S.: BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In: CVPR (2023)

2023
[46]

In: CVPR (2024)

Wen, B., Yang, W., Kautz, J., Birchfield, S.: FoundationPose: Unified 6d pose estimation and tracking of novel objects. In: CVPR (2024)

2024
[47]

arXiv preprint arXiv:2106.14440 (2021)

Wu,R.,Zhao,Y.,Mo,K.,Guo,Z.,Wang,Y.,Wu,T.,Fan,Q.,Chen,X.,Guibas,L., Dong, H.: Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects. arXiv preprint arXiv:2106.14440 (2021)

work page arXiv 2021
[48]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., et al.: Sapien: A simulated part-based interactive environment. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11097–11107 (2020) RelAfford6D 19

2020
[49]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Yang, X., Gong, X.: Foundation model assisted weakly supervised semantic seg- mentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 523–532 (2024)

2024
[50]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

arXiv preprint arXiv:2306.12893 (2023)

Zhang, H., Eisner, B., Held, D.: Flowbot++: Learning generalized articulated ob- jects manipulation via articulation projection. arXiv preprint arXiv:2306.12893 (2023)

work page arXiv 2023
[52]

arXiv preprint arXiv:2507.18276 (2025)

Zhang, X., Wang, Y., Wu, R., Xu, K., Li, Y., Xiang, L., Dong, H., He, Z.: Adaptive articulated object manipulation on the fly with foundation model reasoning and part grounding. arXiv preprint arXiv:2507.18276 (2025)

work page arXiv 2025

[1] [1]

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., Gopalakr- ishnan, K., Hausman, K., Herzog, A., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jackson, T., Jesmonth, S., Joshi, N.J., Julian, R., Kalashnikov, D., Kuang, Y., Leal, I., Lee, K.H., Levine, S., Lu, Y., Malla, U., Manjunath, D., Mordatch, I., Nachum, O., Parada, C., Peralta, J...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, H.K., ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

ShapeNet: An Information-Rich 3D Model Repository

Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Chen, J., Gao, D., Lin, K.Q., Shou, M.Z.: Affordance grounding from demon- stration video to target image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6799–6808 (2023)

2023

[5] [5]

The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

Chi,C.,Xu,Z.,Feng,S.,Cousineau,E.,Du,Y.,Burchfiel,B.,Tedrake,R.,Song,S.: Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research44(10-11), 1684–1704 (2025)

2025

[6] [6]

Davidson, J.K., Hunt, K.H., Pennock, G.R.: Robots and screw theory: applications of kinematics and statics to robotics. J. Mech. Des.126(4), 763–764 (2004)

2004

[7] [7]

In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3d affordancenet: A benchmark for vi- sual object affordance understanding. In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1778–1787 (2021)

2021

[8] [8]

In: 2018 IEEE international conference on robotics and automation (ICRA)

Do, T.T., Nguyen, A., Reid, I.: Affordancenet: An end-to-end deep learning ap- proach for object affordance detection. In: 2018 IEEE international conference on robotics and automation (ICRA). pp. 5882–5889. IEEE (2018)

2018

[9] [9]

arXiv preprint arXiv:2205.04382 (2022)

Eisner, B., Zhang, H., Held, D.: Flowbot3d: Learning 3d articulation flow to ma- nipulate articulated objects. arXiv preprint arXiv:2205.04382 (2022)

work page arXiv 2022

[10] [10]

arXiv preprint arXiv:2209.12941 (2022)

Geng, Y., An, B., Geng, H., Chen, Y., Yang, Y., Dong, H.: End-to-end affordance learning for robotic manipulation. arXiv preprint arXiv:2209.12941 (2022)

work page arXiv 2022

[11] [11]

Journal of Image and Graphics31(6), 1911–1941 (2026)

He, Y., Lu, H., Wang, D., Li, S., Li, Z., Liu, Y., Zhao, J., Ruan, S.: Vision-language- action models: Current developments and frontier advances. Journal of Image and Graphics31(6), 1911–1941 (2026)

1911

[12] [12]

Huang, W., Wang, C., Li, Y., Zhang, R., Fei-Fei, L.: Rekep: Spatio-temporal rea- soning of relational keypoint constraints for robotic manipulation (2024),https: //arxiv.org/abs/2409.01652

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Jiang, Q., Li, F., Zeng, Z., Ren, T., Liu, S., Zhang, L.: T-rex2: Towards generic object detection via text-visual prompt synergy (2024)

2024

[15] [15]

In: Proceedings of the European conference on computer vision (ECCV)

Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Proceedings of the European conference on computer vision (ECCV). pp. 371–386 (2018)

2018

[16] [16]

In: Experimental Robotics: The 12th International Symposium on Experimental Robotics

Katz, D., Orthey, A., Brock, O.: Interactive perception of articulated objects. In: Experimental Robotics: The 12th International Symposium on Experimental Robotics. pp. 301–315. Springer (2014)

2014

[17] [17]

OpenVLA: An Open-Source Vision-Language-Action Model

Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

In: Proceedings 2000 ICRA

Kuffner, J., LaValle, S.: Rrt-connect: An efficient approach to single-query path planning. In: Proceedings 2000 ICRA. Millennium Conference. IEEE Interna- tional Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065). vol. 2, pp. 995–1001 vol.2 (2000).https://doi.org/10.1109/ ROBOT.2000.844730

work page arXiv 2000

[19] [19]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, G., Jampani, V., Sun, D., Sevilla-Lara, L.: Locate: Localize and transfer ob- ject parts for weakly supervised affordance grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10922– 10931 (2023) RelAfford6D 17

2023

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, X., Zhang, M., Geng, Y., Geng, H., Long, Y., Shen, Y., Zhang, R., Liu, J., Dong, H.: Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18061–18070 (2024)

2024

[21] [21]

Ling, S., Wang, Y., Wu, S., Zhuang, Y., Xu, T., Li, Y., Liu, C., Dong, H.: Artic- ulated object manipulation with coarse-to-fine affordance for mitigating the effect of point cloud noise (2024),https://arxiv.org/abs/2402.18699

work page arXiv 2024

[22] [22]

Moka: Open-world robotic manipulation through mark-based visual prompting.arXiv preprint arXiv:2403.03174, 2024a

Liu, F., Fang, K., Abbeel, P., Levine, S.: Moka: Open-world robotic manipulation through mark-based visual prompting. arXiv preprint arXiv:2403.03174 (2024)

work page arXiv 2024

[23] [23]

Advances in Neural Information Processing Systems37, 40085–40110 (2024)

Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S.: Robomamba: Efficient vision-language-action model for robotic rea- soning and manipulation. Advances in Neural Information Processing Systems37, 40085–40110 (2024)

2024

[24] [24]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Liu, J., Liu, M., Wang, Z., An, P., Li, X., Zhou, K., Yang, S., Zhang, R., Guo, Y., Zhang, S.: Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 40085–40110....

2024

[25] [25]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Liu, L., Xu, W., Fu, H., Qian, S., Yu, Q., Han, Y., Lu, C.: Akb-48: A real-world articulated object knowledge base. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14809–14818 (2022)

2022

[26] [26]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H., Zhu, J.: Rdt-1b: a diffusion foundation model for bimanual manipulation. arXiv preprint arXiv:2410.07864 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Lu, D., Kong, L., Huang, T., Lee, G.H.: Geal: Generalizable 3d affordance learning with cross-modal consistency (2024),https://arxiv.org/abs/2412.09511

work page arXiv 2024

[28] [28]

IEEE Transactions on Artificial Intelligence4(5), 1186–1198 (2022)

Lu, L., Zhai, W., Luo, H., Kang, Y., Cao, Y.: Phrase-based affordance detection via cyclic bilateral interaction. IEEE Transactions on Artificial Intelligence4(5), 1186–1198 (2022)

2022

[29] [29]

arXiv preprint arXiv:2209.05672 (2022)

Mahalingam, D., Chakraborty, N.: Human-guided planning for complex manipu- lation tasks using the screw geometry of motion. arXiv preprint arXiv:2209.05672 (2022)

work page arXiv 2022

[30] [30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Mo, K., Guibas, L.J., Mukadam, M., Gupta, A., Tulsiani, S.: Where2act: From pixels to actions for articulated 3d objects. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6813–6823 (2021)

2021

[31] [31]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object un- derstanding. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 909–918 (2019)

2019

[32] [32]

Journal of Image and Graphics31(6), 2017–2025 (2026).https://doi.org/10.11834/jig.260059

Mu, Y., Zhao, H., Hu, R., Zhang, L., Li, H., Yang, J., Wang, J., Han, L., Su, Y., Xu, K., Yang, Y., Li, J., Dai, R., Chen, B., Liu, Y., Yi, L.: Frontiers and prospects of embodied ai: Evolution of data, models, and systems. Journal of Image and Graphics31(6), 2017–2025 (2026).https://doi.org/10.11834/jig.260059

work page doi:10.11834/jig.260059 2017

[33] [33]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Nguyen, T., Vu, M.N., Huang, B., Van Vo, T., Truong, V., Le, N., Vo, T., Le, B., Nguyen, A.: Language-conditioned affordance-pose detection in 3d point clouds. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 3071–3078. IEEE (2024) 18 G. Zhang et al

2024

[34] [34]

In: 2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS)

Nguyen, T., Vu, M.N., Vuong, A., Nguyen, D., Vo, T., Le, N., Nguyen, A.: Open- vocabulary affordance detection in 3d point clouds. In: 2023 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). pp. 5692–5698. IEEE (2023)

2023

[35] [35]

Pan, M., Zhang, J., Wu, T., Zhao, Y., Gao, W., Dong, H.: Omnimanip: Towards general robotic manipulation via object-centric interaction primitives as spatial constraints (2025),https://arxiv.org/abs/2501.03841

work page arXiv 2025

[36] [36]

In: Conference on robot learning

Qin,Y.,Chen,R.,Zhu,H.,Song,M.,Xu,J.,Su,H.:S4g:Amodalsingle-viewsingle- shot se (3) grasp detection in cluttered scenes. In: Conference on robot learning. pp. 53–65. PMLR (2020)

2020

[37] [37]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (11 2019),https: //arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[38] [38]

org/abs/2405.10300

Ren, T., Jiang, Q., Liu, S., Zeng, Z., Liu, W., Gao, H., Huang, H., Ma, Z., Jiang, X., Chen, Y., Xiong, Y., Zhang, H., Li, F., Tang, P., Yu, K., Zhang, L.: Grounding dino 1.5: Advance the "edge" of open-set object detection (2024),https://arxiv. org/abs/2405.10300

work page arXiv 2024

[39] [39]

IEEE Robotics and Automa- tion Letters5(3), 4978–4985 (2020)

Song, S., Zeng, A., Lee, J., Funkhouser, T.: Grasping in the wild: Learning 6dof closed-loop grasping from low-cost demonstrations. IEEE Robotics and Automa- tion Letters5(3), 4978–4985 (2020)

2020

[40] [40]

Tang, Y., Huang, W., Wang, Y., Li, C., Yuan, R., Zhang, R., Wu, J., Fei-Fei, L.: Uad: Unsupervised affordance distillation for generalization in robotic manipula- tion (2025),https://arxiv.org/abs/2506.09284

work page arXiv 2025

[41] [41]

Octo: An Open-Source Generalist Robot Policy

Team, O.M., Ghosh, D., Walke, H., Pertsch, K., Black, K., Mees, O., Dasari, S., Hejna, J., Kreiman, T., Xu, C., et al.: Octo: An open-source generalist robot policy. arXiv preprint arXiv:2405.12213 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision

Wang, J., Dasari, S., Srirama, M.K., Tulsiani, S., Gupta, A.: Manipulate by seeing: Creating manipulation controllers from pre-trained representations. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision. pp. 3859– 3868 (2023)

2023

[43] [43]

arXiv preprint arXiv:2309.16118 (2023)

Wang, Y., Zhang, M., Li, Z., Kelestemur, T., Driggs-Campbell, K., Wu, J., Fei- Fei, L., Li, Y.: D3 Fields: Dynamic 3d descriptor fields for zero-shot generalizable rearrangement. arXiv preprint arXiv:2309.16118 (2023)

work page arXiv 2023

[44] [44]

arXiv preprint arXiv:2502.11124 (2025)

Wang, Y., Zhang, X., Wu, R., Li, Y., Shen, Y., Wu, M., He, Z., Wang, Y., Dong, H.: Adamanip: Adaptive articulated object manipulation environments and policy learning. arXiv preprint arXiv:2502.11124 (2025)

work page arXiv 2025

[45] [45]

In: CVPR (2023)

Wen, B., Tremblay, J., Blukis, V., Tyree, S., Müller, T., Evans, A., Fox, D., Kautz, J., Birchfield, S.: BundleSDF: Neural 6-DoF tracking and 3D reconstruction of unknown objects. In: CVPR (2023)

2023

[46] [46]

In: CVPR (2024)

Wen, B., Yang, W., Kautz, J., Birchfield, S.: FoundationPose: Unified 6d pose estimation and tracking of novel objects. In: CVPR (2024)

2024

[47] [47]

arXiv preprint arXiv:2106.14440 (2021)

Wu,R.,Zhao,Y.,Mo,K.,Guo,Z.,Wang,Y.,Wu,T.,Fan,Q.,Chen,X.,Guibas,L., Dong, H.: Vat-mart: Learning visual action trajectory proposals for manipulating 3d articulated objects. arXiv preprint arXiv:2106.14440 (2021)

work page arXiv 2021

[48] [48]

In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., et al.: Sapien: A simulated part-based interactive environment. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11097–11107 (2020) RelAfford6D 19

2020

[49] [49]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Yang, X., Gong, X.: Foundation model assisted weakly supervised semantic seg- mentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 523–532 (2024)

2024

[50] [50]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Ze, Y., Zhang, G., Zhang, K., Hu, C., Wang, M., Xu, H.: 3d diffusion policy: Gen- eralizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

arXiv preprint arXiv:2306.12893 (2023)

Zhang, H., Eisner, B., Held, D.: Flowbot++: Learning generalized articulated ob- jects manipulation via articulation projection. arXiv preprint arXiv:2306.12893 (2023)

work page arXiv 2023

[52] [52]

arXiv preprint arXiv:2507.18276 (2025)

Zhang, X., Wang, Y., Wu, R., Xu, K., Li, Y., Xiang, L., Dong, H., He, Z.: Adaptive articulated object manipulation on the fly with foundation model reasoning and part grounding. arXiv preprint arXiv:2507.18276 (2025)

work page arXiv 2025