pith. sign in

arxiv: 2607.02045 · v1 · pith:DXPFVZ4Bnew · submitted 2026-07-02 · 💻 cs.CV

PWM-ArtGen: Part World Model for Articulated Object Generation

Pith reviewed 2026-07-03 15:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords articulated object generationkinematic structurepart world modeldiffusion modelszero-shot generalizationimage diffusionaction diffusion3D object generation
0
0 comments X

The pith

Coupling action diffusion with image diffusion recovers kinematic structure from single images of articulated objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to generate articulated 3D objects from a single image by recovering the underlying kinematic structure more accurately than prior approaches. Existing methods either rely on static images that miss dynamic part relationships or use sequential estimation steps that accumulate errors, while small annotated datasets limit generalization. The authors introduce a Part World Model that learns the joint distribution of visual dynamics and kinematic parameters by coupling action diffusion and image diffusion with independent timesteps, allowing co-training on unannotated data. They support this with a new photorealistic dataset of 19.7k part-level image pairs. If correct, the approach would produce better resting-state outputs and stronger zero-shot performance on out-of-distribution objects.

Core claim

Articulated objects can be treated as dynamic systems whose visual dynamics and kinematic parameters are learned jointly; a unified Part World Model couples action diffusion and image diffusion using independent timesteps to enable co-training on unannotated data, yielding substantially better kinematic recovery than baselines or two-step pipelines.

What carries the argument

Part World Model (PWM-ArtGen) that couples action diffusion and image diffusion with independent diffusion timesteps to support visual-branch co-training on unannotated image pairs.

If this is right

  • Kinematic parameters can be recovered directly from the joint distribution rather than from static images or sequential estimation.
  • Unannotated photorealistic image pairs become usable for training through independent-timestep co-training.
  • Performance improves in the resting state relative to existing baselines.
  • Zero-shot generalization extends to out-of-distribution articulated objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same coupling mechanism could be tested on video prediction tasks where both appearance and motion parameters must be inferred together.
  • If the independent-timestep design scales, it might reduce reliance on fully annotated kinematic datasets across other 3D generation domains.
  • The approach suggests that treating generation as joint modeling of dynamics and parameters may apply to non-articulated deformable objects as well.

Load-bearing premise

Coupling action diffusion and image diffusion with independent timesteps on unannotated data will recover accurate kinematic structure without the accumulated errors of two-step methods.

What would settle it

On a held-out test set of articulated objects with known ground-truth joints and part motions, the model produces incorrect part trajectories or joint parameters at rates no better than two-step baselines.

Figures

Figures reproduced from arXiv: 2607.02045 by Ancong Wu, Wentao Zheng.

Figure 1
Figure 1. Figure 1: Overview of PWM-ArtGen, an end-to-end pipeline to convert a single image into an articulated 3D object: (1) Parts mask generator: from image o, a pretrained module based on SAM [17] and GPT-4o [14], extracts N part masks {mi} N i=1 and an articulate graph for assembly, (2) Part-level inference: p ref i is instantiated as the part mask mi, sample (o ′ i , ai) ∼ p(o ′ , a | o, p ref i ), (3) Assembly: align … view at source ↗
Figure 2
Figure 2. Figure 2: A single Part World Model (PWM-ArtGen) block consists of a transformer block with observation, a target articulated part, and diffusion timesteps conditioning via adaptive layer norm. predicts the actions and observation noises ϵa and ϵo ′ , conditioned on static observations o and the reference part p ref , where o is the input image and p ref specifies the target articulated part. In accordance with the … view at source ↗
Figure 3
Figure 3. Figure 3: Construction pipeline of PartNet-Mobility-Reality (PM-R). Synthetic render￾ings from PartNet-Mobility are enhanced into photorealistic counterparts through the Qwen-Image-Edit-2509 [40] model guided by a structured Prompt Library, producing 19.7k photorealistic samples. To enhance the model’s generalization ability on real-world images, we con￾struct the PM-R dataset as illustrated in [PITH_FULL_IMAGE:fig… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison on the ACD dataset under zero-shot evaluation. From left to right: input image, ground truth (GT), our method, SINGAPO, and Articulate￾Anything. Each row shows the predicted part layout and motion axes in both resting and articulated states. The last four rows highlight that other methods’ predictions exhibit spatial misalignment with GT in part positioning and joint orientation. In … view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of interactive dynamics generation. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the Part Mask Generator pipeline. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: First-step prompt for GPT-4o: Abstract graph generation. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Second-step prompt for GPT-4o: Mask ID assignment. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sensitivity analysis of the Co-T weight on the ACD dataset. B.2 Justification for Baseline Selection and Fair Comparison As shown in Tab. 6, the released checkpoint of Singapo trained on a much larger dataset of 3,063 objects significantly outperforms variants retrained on our lim￾ited PartNet-Mobility datasets. This performance gap confirms that Singapo, [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative comparison on the PartNet-Mobility dataset. From left to right: input image, ground truth (GT), our method, SINGAPO, and Articulate-Anything. Each row shows the predicted part layout and motion axes in both resting and articu￾lated states. The last row highlights that SINGAPO’s predictions exhibit spatial mis￾alignment with GT in part positioning and joint orientation. In contrast, our method … view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative results on the ACD test dataset and in-the-wild or internet-sourced images. First two rows: Successful reconstructions from ACD test inputs, demonstrating robust part-level articulation recovery under complex textures, occlusions, and lighting variations. Third row: Failure cases, typically occurring when input objects exhibit base geometries or articulation patterns outside the tra… view at source ↗
read the original abstract

The key challenge in articulated 3D object generation from a single image is accurately predicting the underlying kinematic structure. Existing methods either infer kinematic parameters directly from a static image that lacks dynamic part-level kinematic relationships, or estimate parameters from visual dynamics generated from a single image, which is prone to accumulated errors of two steps. Moreover, the limited scale and diversity of existing annotated datasets further hinder generalization to complex, real-world objects. To overcome these limitations, we propose to learn the joint distribution of visual dynamics and kinematic parameters. Recognizing that articulated objects can be formulated as dynamic systems, we propose a unified Part World Model called PWM-ArtGen. To leverage unannotated data, this model couples action diffusion and image diffusion with independent diffusion timesteps, which enables visual branch co-training. We further curate a photorealistic dataset of 19.7k part-level image pairs without kinematic annotations, to support co-training. Experiments demonstrate that PWM-ArtGen substantially outperforms existing baselines in the resting state and exhibits strong zero-shot generalization to out-of-distribution objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes PWM-ArtGen, a Part World Model for articulated 3D object generation from a single image. It addresses limitations of prior methods by learning the joint distribution of visual dynamics and kinematic parameters via a unified model that couples action diffusion and image diffusion using independent timesteps, enabling co-training on a newly curated photorealistic dataset of 19.7k part-level image pairs without kinematic annotations. The central claims are that this yields substantially better performance than baselines in the resting state and strong zero-shot generalization to out-of-distribution objects.

Significance. If the joint-training approach on unannotated data reliably recovers kinematic structure, the work could meaningfully advance single-image articulated generation by sidestepping accumulated errors from two-step pipelines and scaling beyond small annotated datasets. The use of independent-timestep diffusion for co-training is a potentially useful technical device for leveraging larger unannotated corpora.

major comments (2)
  1. [Abstract] Abstract: The central claim that coupling action diffusion and image diffusion with independent timesteps on unannotated pairs recovers kinematic parameters more reliably than two-step methods is load-bearing for both the performance and zero-shot OOD generalization assertions, yet the description provides no mechanism ensuring that predicted actions are linked to observed image dynamics during co-training; without such linkage the model could fit spurious correlations.
  2. [Abstract] Abstract: The experiments are said to demonstrate substantial outperformance and strong zero-shot generalization, but no quantitative tables, ablation studies, dataset statistics (e.g., diversity metrics for the 19.7k pairs), or evaluation protocols are supplied, preventing verification that the claimed gains are supported by the data rather than by the unannotated training regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the two major comments below, drawing on details from the full manuscript while noting where the abstract can be clarified.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that coupling action diffusion and image diffusion with independent timesteps on unannotated pairs recovers kinematic parameters more reliably than two-step methods is load-bearing for both the performance and zero-shot OOD generalization assertions, yet the description provides no mechanism ensuring that predicted actions are linked to observed image dynamics during co-training; without such linkage the model could fit spurious correlations.

    Authors: The abstract is concise by design. Section 3 of the manuscript specifies the linkage: the action diffusion branch is conditioned on the source image and the target image from each training pair (with independent timesteps), and the training objective requires that the predicted action, when applied through the dynamics, reconstructs the observed target image via the image diffusion branch. This direct consistency constraint on paired data prevents fitting to spurious correlations. We will revise the abstract to include a short clause describing this conditioning. revision: yes

  2. Referee: [Abstract] Abstract: The experiments are said to demonstrate substantial outperformance and strong zero-shot generalization, but no quantitative tables, ablation studies, dataset statistics (e.g., diversity metrics for the 19.7k pairs), or evaluation protocols are supplied, preventing verification that the claimed gains are supported by the data rather than by the unannotated training regime.

    Authors: Abstracts summarize rather than tabulate. The full manuscript contains quantitative comparisons in Tables 1 and 2 (Section 4), ablations in Section 4.3, dataset statistics and diversity metrics for the 19.7k pairs in Section 3.2, and evaluation protocols in Section 4.1. These results support the performance and zero-shot claims. We can expand the abstract's results sentence or add a pointer if the referee prefers. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on data-driven learning from curated pairs

full rationale

The paper frames its core contribution as learning the joint distribution of visual dynamics and kinematic parameters via coupled diffusion models trained on a 19.7k unannotated dataset. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would reduce any claimed result to an input by construction. The independent-timestep coupling and zero-shot generalization are presented as empirical outcomes of co-training rather than tautological definitions or self-referential fits. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that articulated objects behave as dynamic systems whose visual and kinematic distributions can be jointly modeled by diffusion; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Articulated objects can be formulated as dynamic systems
    Explicitly invoked to justify the Part World Model construction.

pith-pipeline@v0.9.1-grok · 5708 in / 1250 out tokens · 25335 ms · 2026-07-03T15:58:43.810403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

  2. [2]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

  3. [3]

    Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators, accessed: 29 June 2026

  4. [4]

    Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y.J., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., Aytar, Y., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zolna, K., Clune, J., De Freitas, N., Singh, S., Rocktäschel, T.: Genie: generative interactive environment...

  5. [5]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

    Chen, C., Liu, I., Wei, X., Su, H., Liu, M.: Freeart3d: Training-free articulated object generation using 3d diffusion. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

  6. [6]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

    Chen, H., Lan, Y., Chen, Y., Pan, X.: Artilatent: Realistic articulated 3d object generation via structured latents. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

  7. [7]

    In: Proceedings of Robotics: Science and Systems (2024)

    Chen, Z., Walsman, A., Memmel, M., Mo, K., Fang, A., Vemuri, K., Wu, A., Fox, D., Gupta, A.: Urdformer: A pipeline for constructing articulated simulation environments from real-world images. In: Proceedings of Robotics: Science and Systems (2024)

  8. [8]

    In: IEEE Conf

    Collins, J., Goel, S., Deng, K., Luthra, A., Xu, L., Gundogdu, E., Zhang, X., Vicente, T.F.Y., Dideriksen, T., Arora, H., Guillaumin, M., Malik, J.: Abo: Dataset and benchmarks for real-world 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21094–21104 (2022)

  9. [9]

    In: IEEE Conf

    Geng, H., Xu, H., Zhao, C., Xu, C., Yi, L., Huang, S., Wang, H.: Gapartnet: Cross-category domain-generalizable object perception and manipulation via gen- eralizable and actionable parts. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7081–7091 (2023)

  10. [10]

    Guo, Y., Hu, Y., Zhang, J., Wang, Y.J., Chen, X., Lu, C., Chen, J.: Prediction with action: Visual policy learning via joint denoising process. In: Adv. Neural Inform. Process. Syst. vol. 37, pp. 112386–112410 (2024)

  11. [11]

    In: IEEE Conf

    Heppert, N., Irshad, M.Z., Zakharov, S., Liu, K., Ambrus, R.A., Bohg, J., Valada, A., Kollar, T.: Carto: Category and joint agnostic reconstruction of articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21201–21210 (2023)

  12. [12]

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv. Neural Inform. Process. Syst. vol. 33, pp. 6840–6851 (2020) 16 W. Zheng and A. Wu

  13. [13]

    ACM Trans

    Hu, R., Li, W., Van Kaick, O., Shamir, A., Zhang, H., Huang, H.: Learning to predict part mobility from a single static snapshot. ACM Trans. Graph.36(6), 1–13 (2017)

  14. [14]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  15. [15]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Iliash, D., Jiang, H., Zhang, Y., Savva, M., Chang, A.X.: S2o: Static to openable enhancement for articulated 3d objects. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6785–6795 (2026)

  16. [16]

    In: IEEE Conf

    Jiang, Z., Hsu, C.C., Zhu, Y.: Ditto: Building digital twins of articulated objects from interaction. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5616–5626 (2022)

  17. [17]

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: Int. Conf. Comput. Vis. (2023)

  18. [18]

    Le, L., Xie, J., Liang, W., Wang, H.J., Yang, Y., Ma, Y.J., Vedder, K., Krishna, A., Jayaraman, D., Eaton, E.: Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model. In: Int. Conf. Learn. Represent. (2025)

  19. [19]

    Lei, J., Deng, C., Shen, W.B., Guibas, L., Daniilidis, K.: Nap: Neural 3d articulated object prior. In: Adv. Neural Inform. Process. Syst. vol. 36, pp. 31878–31894 (2023)

  20. [20]

    Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Dragapart: Learning a part-level motion prior for articulated objects. In: Eur. Conf. Comput. Vis. pp. 165–183. Springer (2024)

  21. [21]

    Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Puppet-master: Scaling interactive video generation as a motion prior for part-level dynamics. In: Int. Conf. Comput. Vis. pp. 13405–13415 (2025)

  22. [22]

    Liu, J., Iliash, D., Chang, A., Savva, M., Mahdavi Amiri, A.: Singapo: Single image controlled generation of articulated parts in objects. In: Int. Conf. Learn. Represent. (2025)

  23. [23]

    Liu, J., Mahdavi-Amiri, A., Savva, M.: Paris: Part-level reconstruction and motion analysis for articulated objects. In: Int. Conf. Comput. Vis. pp. 352–363 (2023)

  24. [24]

    In: IEEE Conf

    Liu, J., Tam, H.I.I., Mahdavi-Amiri, A., Savva, M.: Cage: Controllable articulation generation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 17880–17889 (2024)

  25. [25]

    Liu, S., Gupta, S., Wang, S.: Building rearticulable models for arbitrary 3d objects from4dpointclouds.In:IEEEConf.Comput.Vis.PatternRecog.pp.21138–21147 (2023)

  26. [26]

    arXiv preprint arXiv:2509.17647 (2025)

    Liu, Y., Jia, B., Lu, R., Gan, C., Chen, H., Ni, J., Zhu, S.C., Huang, S.: Videoartgs: Building digital twins of articulated objects from monocular video. arXiv preprint arXiv:2509.17647 (2025)

  27. [27]

    Liu,Y.,Jia,B.,Lu,R.,Ni,J.,Zhu,S.C.,Huang,S.:Buildinginteractablereplicasof complex articulated objects via gaussian splatting. In: Int. Conf. Learn. Represent. (2025)

  28. [28]

    arXiv preprint arXiv:2507.05763 (2025)

    Lu, R., Liu, Y., Tang, J., Ni, J., Wang, Y., Wan, D., Zeng, G., Chen, Y., Huang, S.: Dreamart: Generating interactable articulated objects from a single image. arXiv preprint arXiv:2507.05763 (2025)

  29. [29]

    In: IEEE Conf

    Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Part- net: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 909–918 (2019) PWM-ArtGen: Part World Model for Articulated Object Generation 17

  30. [30]

    Mu, J., Qiu, W., Kortylewski, A., Yuille, A., Vasconcelos, N., Wang, X.: A-sdf: Learning disentangled signed distance functions for articulated shape representa- tion. In: Int. Conf. Comput. Vis. pp. 13001–13011 (2021)

  31. [31]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Trans. Mach. Learn Res. (2024)

  32. [32]

    In: International Conference on 3D Vision (3DV)

    Peng, W., Lv, J., Lu, C., Savva, M.: itaco: Interactable digital twins of articulated objects from casually captured rgbd videos. In: International Conference on 3D Vision (3DV). pp. 520–531 (2026)

  33. [33]

    In: IEEE/RSJ International Conference on Intelligent Robots and Systems

    Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Pérez-D’Arpino, C., Buch, S., Srivastava, S., Tchapmi, L., et al.: igibson 1.0: A simulation envi- ronment for interactive tasks in large realistic scenes. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 7520–7527. IEEE (2021)

  34. [34]

    In: IEEE Conf

    Song, C., Wei, J., Foo, C.S., Lin, G., Liu, F.: Reacto: Reconstructing articulated objects from a single video. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5384– 5395 (2024)

  35. [35]

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf. Learn. Represent. (2021)

  36. [36]

    In: IEEE Conf

    Su, J., Feng, Y., Li, Z., Song, J., He, Y., Ren, B., Xu, B.: Artformer: Controllable generation of diverse 3d articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1894–1904 (2025)

  37. [37]

    In: International Conference on Robotics and Automa- tion

    Tseng, W.C., Liao, H.J., Yen-Chen, L., Sun, M.: Cla-nerf: Category-level articu- lated neural radiance field. In: International Conference on Robotics and Automa- tion. pp. 8454–8460 (2022)

  38. [38]

    In: IEEE Conf

    Wang, X., Zhou, B., Shi, Y., Chen, X., Zhao, Q., Xu, K.: Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 8876–8884 (2019)

  39. [39]

    In: IEEE Conf

    Weng, Y., Wen, B., Tremblay, J., Blukis, V., Fox, D., Guibas, L., Birchfield, S.: Neural implicit representation for building digital twins of unknown articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3141–3150 (2024)

  40. [40]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  41. [41]

    Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large-scale video generative pre-training for visual robot manipulation. In: Int. Conf. Learn. Represent. (2024)

  42. [42]

    Wu, R., wang, X., Liu.Liu, Guo, C.L., Qiu, J., Li, C., Huang, L., Su, Z., Cheng, M.M.: Dipo: Dual-state images controlled articulated object generation powered by diverse data. In: Adv. Neural Inform. Process. Syst. vol. 38, pp. 108665–108689 (2025)

  43. [43]

    arXiv preprint arXiv:2603.16806 (2026)

    Wu, Y., Lin, Y., Lao, W., Lin, Y., Wei, Y.L., Zheng, W.S., Wu, A.: Dexgrasp-zero: A morphology-aligned policy for zero-shot cross-embodiment dexterous grasping. arXiv preprint arXiv:2603.16806 (2026)

  44. [44]

    In: IEEE Conf

    Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: Sapien: A simulated part- based interactive environment. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11094–11104 (2020)

  45. [45]

    ACM Trans

    Yan, Z., Hu, R., Yan, X., Chen, L., Van Kaick, O., Zhang, H., Huang, H.: Rpm-net: recurrent prediction of motion and parts from point cloud. ACM Trans. Graph. 38(6), 1–15 (2019) 18 W. Zheng and A. Wu

  46. [46]

    Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: Int. Conf. Learn. Represent. (2025)

  47. [47]

    base". • 2) Describe how the parts are connected and then organize them in a part connectivity graph. The

    Zhu, C., Yu, R., Feng, S., Burchfiel, B., Shah, P., Gupta, A.: Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. In: Proceedings of Robotics: Science and Systems (2025) PWM-ArtGen: Part World Model for Articulated Object Generation 19 Our supplementary materials provide comprehensive implementation detail...

  48. [48]

    A segmentation image with region IDs

  49. [49]

    base\": [{\

    A part connectivity graph (containing only: 'base', 'door', 'drawer') where: - There is exactly one base. - All doors and drawers attach directly to the base. - The child order in the graph already reflects spatial ordering and must be preserved. An example of part connectivity graph: I recognize all the articulated parts in a storage furniture, they are:...

  50. [50]

    Do NOT create or renumber IDs

    Use only region IDs that appear in the segmentation result. Do NOT create or renumber IDs

  51. [51]

    Each region ID belongs to exactly one part instance (no overlaps)

  52. [52]

    base": {

    If a single part spans multiple region IDs, include all of them in that part’s `ids` array. Here is an example of your response: I recognize all the articulated parts of a storage furniture, I recognize all the articulated parts in a storage furniture, they are: base[<ids>], door[<ids>] (attach to base), drawer[<ids>] (attach to base). ```json {"base": {"...