PWM-ArtGen: Part World Model for Articulated Object Generation

Ancong Wu; Wentao Zheng

arxiv: 2607.02045 · v1 · pith:DXPFVZ4Bnew · submitted 2026-07-02 · 💻 cs.CV

PWM-ArtGen: Part World Model for Articulated Object Generation

Wentao Zheng , Ancong Wu This is my paper

Pith reviewed 2026-07-03 15:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords articulated object generationkinematic structurepart world modeldiffusion modelszero-shot generalizationimage diffusionaction diffusion3D object generation

0 comments

The pith

Coupling action diffusion with image diffusion recovers kinematic structure from single images of articulated objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to generate articulated 3D objects from a single image by recovering the underlying kinematic structure more accurately than prior approaches. Existing methods either rely on static images that miss dynamic part relationships or use sequential estimation steps that accumulate errors, while small annotated datasets limit generalization. The authors introduce a Part World Model that learns the joint distribution of visual dynamics and kinematic parameters by coupling action diffusion and image diffusion with independent timesteps, allowing co-training on unannotated data. They support this with a new photorealistic dataset of 19.7k part-level image pairs. If correct, the approach would produce better resting-state outputs and stronger zero-shot performance on out-of-distribution objects.

Core claim

Articulated objects can be treated as dynamic systems whose visual dynamics and kinematic parameters are learned jointly; a unified Part World Model couples action diffusion and image diffusion using independent timesteps to enable co-training on unannotated data, yielding substantially better kinematic recovery than baselines or two-step pipelines.

What carries the argument

Part World Model (PWM-ArtGen) that couples action diffusion and image diffusion with independent diffusion timesteps to support visual-branch co-training on unannotated image pairs.

If this is right

Kinematic parameters can be recovered directly from the joint distribution rather than from static images or sequential estimation.
Unannotated photorealistic image pairs become usable for training through independent-timestep co-training.
Performance improves in the resting state relative to existing baselines.
Zero-shot generalization extends to out-of-distribution articulated objects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same coupling mechanism could be tested on video prediction tasks where both appearance and motion parameters must be inferred together.
If the independent-timestep design scales, it might reduce reliance on fully annotated kinematic datasets across other 3D generation domains.
The approach suggests that treating generation as joint modeling of dynamics and parameters may apply to non-articulated deformable objects as well.

Load-bearing premise

Coupling action diffusion and image diffusion with independent timesteps on unannotated data will recover accurate kinematic structure without the accumulated errors of two-step methods.

What would settle it

On a held-out test set of articulated objects with known ground-truth joints and part motions, the model produces incorrect part trajectories or joint parameters at rates no better than two-step baselines.

Figures

Figures reproduced from arXiv: 2607.02045 by Ancong Wu, Wentao Zheng.

**Figure 1.** Figure 1: Overview of PWM-ArtGen, an end-to-end pipeline to convert a single image into an articulated 3D object: (1) Parts mask generator: from image o, a pretrained module based on SAM [17] and GPT-4o [14], extracts N part masks {mi} N i=1 and an articulate graph for assembly, (2) Part-level inference: p ref i is instantiated as the part mask mi, sample (o ′ i , ai) ∼ p(o ′ , a | o, p ref i ), (3) Assembly: align … view at source ↗

**Figure 2.** Figure 2: A single Part World Model (PWM-ArtGen) block consists of a transformer block with observation, a target articulated part, and diffusion timesteps conditioning via adaptive layer norm. predicts the actions and observation noises ϵa and ϵo ′ , conditioned on static observations o and the reference part p ref , where o is the input image and p ref specifies the target articulated part. In accordance with the … view at source ↗

**Figure 3.** Figure 3: Construction pipeline of PartNet-Mobility-Reality (PM-R). Synthetic renderings from PartNet-Mobility are enhanced into photorealistic counterparts through the Qwen-Image-Edit-2509 [40] model guided by a structured Prompt Library, producing 19.7k photorealistic samples. To enhance the model’s generalization ability on real-world images, we construct the PM-R dataset as illustrated in [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on the ACD dataset under zero-shot evaluation. From left to right: input image, ground truth (GT), our method, SINGAPO, and ArticulateAnything. Each row shows the predicted part layout and motion axes in both resting and articulated states. The last four rows highlight that other methods’ predictions exhibit spatial misalignment with GT in part positioning and joint orientation. In … view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of interactive dynamics generation. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the Part Mask Generator pipeline. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: First-step prompt for GPT-4o: Abstract graph generation. [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Second-step prompt for GPT-4o: Mask ID assignment. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Sensitivity analysis of the Co-T weight on the ACD dataset. B.2 Justification for Baseline Selection and Fair Comparison As shown in Tab. 6, the released checkpoint of Singapo trained on a much larger dataset of 3,063 objects significantly outperforms variants retrained on our limited PartNet-Mobility datasets. This performance gap confirms that Singapo, [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison on the PartNet-Mobility dataset. From left to right: input image, ground truth (GT), our method, SINGAPO, and Articulate-Anything. Each row shows the predicted part layout and motion axes in both resting and articulated states. The last row highlights that SINGAPO’s predictions exhibit spatial misalignment with GT in part positioning and joint orientation. In contrast, our method … view at source ↗

**Figure 11.** Figure 11: Additional qualitative results on the ACD test dataset and in-the-wild or internet-sourced images. First two rows: Successful reconstructions from ACD test inputs, demonstrating robust part-level articulation recovery under complex textures, occlusions, and lighting variations. Third row: Failure cases, typically occurring when input objects exhibit base geometries or articulation patterns outside the tra… view at source ↗

read the original abstract

The key challenge in articulated 3D object generation from a single image is accurately predicting the underlying kinematic structure. Existing methods either infer kinematic parameters directly from a static image that lacks dynamic part-level kinematic relationships, or estimate parameters from visual dynamics generated from a single image, which is prone to accumulated errors of two steps. Moreover, the limited scale and diversity of existing annotated datasets further hinder generalization to complex, real-world objects. To overcome these limitations, we propose to learn the joint distribution of visual dynamics and kinematic parameters. Recognizing that articulated objects can be formulated as dynamic systems, we propose a unified Part World Model called PWM-ArtGen. To leverage unannotated data, this model couples action diffusion and image diffusion with independent diffusion timesteps, which enables visual branch co-training. We further curate a photorealistic dataset of 19.7k part-level image pairs without kinematic annotations, to support co-training. Experiments demonstrate that PWM-ArtGen substantially outperforms existing baselines in the resting state and exhibits strong zero-shot generalization to out-of-distribution objects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PWM-ArtGen couples action and image diffusion branches with independent timesteps to train on unannotated pairs and avoid two-step error buildup, but the abstract gives no mechanism showing how that coupling actually recovers kinematics.

read the letter

The main point is that this paper tries to learn the joint distribution of visual dynamics and kinematic parameters for articulated objects by running an action diffusion model and an image diffusion model together, each with its own timestep schedule. That setup lets them co-train on the new 19.7k unannotated part-level image pairs instead of needing kinematic labels. The claim is that this sidesteps both the static-image limitation and the accumulated errors of generate-then-estimate pipelines.

What stands out is the practical choice to work with unannotated data at scale and the framing of articulated objects as dynamic systems. Curating the photorealistic dataset is a concrete step that prior work often skipped.

The soft spot is exactly the one the stress-test flags: independent timesteps remove any direct training signal that would force the predicted actions to produce the observed image changes. Nothing in the abstract shows how the model is prevented from learning spurious correlations instead of real part kinematics, especially for the zero-shot OOD cases. Without equations, training details, or tables, it is impossible to tell whether the reported gains in the resting state come from the joint model or from dataset size or other factors.

This is for people already working on part-level 3D generation or robotics simulation who need better ways to handle unlabelled articulated data. The central idea is coherent on its own terms and engages the stated limitations of earlier methods, so it is worth sending to referees even if the current evidence is thin. They can check whether the coupling actually works or whether the independent-timestep design leaves the kinematic recovery under-constrained.

Referee Report

2 major / 0 minor

Summary. The paper proposes PWM-ArtGen, a Part World Model for articulated 3D object generation from a single image. It addresses limitations of prior methods by learning the joint distribution of visual dynamics and kinematic parameters via a unified model that couples action diffusion and image diffusion using independent timesteps, enabling co-training on a newly curated photorealistic dataset of 19.7k part-level image pairs without kinematic annotations. The central claims are that this yields substantially better performance than baselines in the resting state and strong zero-shot generalization to out-of-distribution objects.

Significance. If the joint-training approach on unannotated data reliably recovers kinematic structure, the work could meaningfully advance single-image articulated generation by sidestepping accumulated errors from two-step pipelines and scaling beyond small annotated datasets. The use of independent-timestep diffusion for co-training is a potentially useful technical device for leveraging larger unannotated corpora.

major comments (2)

[Abstract] Abstract: The central claim that coupling action diffusion and image diffusion with independent timesteps on unannotated pairs recovers kinematic parameters more reliably than two-step methods is load-bearing for both the performance and zero-shot OOD generalization assertions, yet the description provides no mechanism ensuring that predicted actions are linked to observed image dynamics during co-training; without such linkage the model could fit spurious correlations.
[Abstract] Abstract: The experiments are said to demonstrate substantial outperformance and strong zero-shot generalization, but no quantitative tables, ablation studies, dataset statistics (e.g., diversity metrics for the 19.7k pairs), or evaluation protocols are supplied, preventing verification that the claimed gains are supported by the data rather than by the unannotated training regime.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the two major comments below, drawing on details from the full manuscript while noting where the abstract can be clarified.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that coupling action diffusion and image diffusion with independent timesteps on unannotated pairs recovers kinematic parameters more reliably than two-step methods is load-bearing for both the performance and zero-shot OOD generalization assertions, yet the description provides no mechanism ensuring that predicted actions are linked to observed image dynamics during co-training; without such linkage the model could fit spurious correlations.

Authors: The abstract is concise by design. Section 3 of the manuscript specifies the linkage: the action diffusion branch is conditioned on the source image and the target image from each training pair (with independent timesteps), and the training objective requires that the predicted action, when applied through the dynamics, reconstructs the observed target image via the image diffusion branch. This direct consistency constraint on paired data prevents fitting to spurious correlations. We will revise the abstract to include a short clause describing this conditioning. revision: yes
Referee: [Abstract] Abstract: The experiments are said to demonstrate substantial outperformance and strong zero-shot generalization, but no quantitative tables, ablation studies, dataset statistics (e.g., diversity metrics for the 19.7k pairs), or evaluation protocols are supplied, preventing verification that the claimed gains are supported by the data rather than by the unannotated training regime.

Authors: Abstracts summarize rather than tabulate. The full manuscript contains quantitative comparisons in Tables 1 and 2 (Section 4), ablations in Section 4.3, dataset statistics and diversity metrics for the 19.7k pairs in Section 3.2, and evaluation protocols in Section 4.1. These results support the performance and zero-shot claims. We can expand the abstract's results sentence or add a pointer if the referee prefers. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on data-driven learning from curated pairs

full rationale

The paper frames its core contribution as learning the joint distribution of visual dynamics and kinematic parameters via coupled diffusion models trained on a 19.7k unannotated dataset. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text that would reduce any claimed result to an input by construction. The independent-timestep coupling and zero-shot generalization are presented as empirical outcomes of co-training rather than tautological definitions or self-referential fits. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that articulated objects behave as dynamic systems whose visual and kinematic distributions can be jointly modeled by diffusion; no free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Articulated objects can be formulated as dynamic systems
Explicitly invoked to justify the Part World Model construction.

pith-pipeline@v0.9.1-grok · 5708 in / 1250 out tokens · 25335 ms · 2026-07-03T15:58:43.810403+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Revisiting Feature Prediction for Learning Visual Representations from Video

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators, accessed: 29 June 2026

2024
[4]

Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y.J., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., Aytar, Y., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zolna, K., Clune, J., De Freitas, N., Singh, S., Rocktäschel, T.: Genie: generative interactive environment...

2024
[5]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

Chen, C., Liu, I., Wei, X., Su, H., Liu, M.: Freeart3d: Training-free articulated object generation using 3d diffusion. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

2025
[6]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

Chen, H., Lan, Y., Chen, Y., Pan, X.: Artilatent: Realistic articulated 3d object generation via structured latents. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

2025
[7]

In: Proceedings of Robotics: Science and Systems (2024)

Chen, Z., Walsman, A., Memmel, M., Mo, K., Fang, A., Vemuri, K., Wu, A., Fox, D., Gupta, A.: Urdformer: A pipeline for constructing articulated simulation environments from real-world images. In: Proceedings of Robotics: Science and Systems (2024)

2024
[8]

In: IEEE Conf

Collins, J., Goel, S., Deng, K., Luthra, A., Xu, L., Gundogdu, E., Zhang, X., Vicente, T.F.Y., Dideriksen, T., Arora, H., Guillaumin, M., Malik, J.: Abo: Dataset and benchmarks for real-world 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21094–21104 (2022)

2022
[9]

In: IEEE Conf

Geng, H., Xu, H., Zhao, C., Xu, C., Yi, L., Huang, S., Wang, H.: Gapartnet: Cross-category domain-generalizable object perception and manipulation via gen- eralizable and actionable parts. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7081–7091 (2023)

2023
[10]

Guo, Y., Hu, Y., Zhang, J., Wang, Y.J., Chen, X., Lu, C., Chen, J.: Prediction with action: Visual policy learning via joint denoising process. In: Adv. Neural Inform. Process. Syst. vol. 37, pp. 112386–112410 (2024)

2024
[11]

In: IEEE Conf

Heppert, N., Irshad, M.Z., Zakharov, S., Liu, K., Ambrus, R.A., Bohg, J., Valada, A., Kollar, T.: Carto: Category and joint agnostic reconstruction of articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21201–21210 (2023)

2023
[12]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv. Neural Inform. Process. Syst. vol. 33, pp. 6840–6851 (2020) 16 W. Zheng and A. Wu

2020
[13]

ACM Trans

Hu, R., Li, W., Van Kaick, O., Shamir, A., Zhang, H., Huang, H.: Learning to predict part mobility from a single static snapshot. ACM Trans. Graph.36(6), 1–13 (2017)

2017
[14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Iliash, D., Jiang, H., Zhang, Y., Savva, M., Chang, A.X.: S2o: Static to openable enhancement for articulated 3d objects. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6785–6795 (2026)

2026
[16]

In: IEEE Conf

Jiang, Z., Hsu, C.C., Zhu, Y.: Ditto: Building digital twins of articulated objects from interaction. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5616–5626 (2022)

2022
[17]

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: Int. Conf. Comput. Vis. (2023)

2023
[18]

Le, L., Xie, J., Liang, W., Wang, H.J., Yang, Y., Ma, Y.J., Vedder, K., Krishna, A., Jayaraman, D., Eaton, E.: Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model. In: Int. Conf. Learn. Represent. (2025)

2025
[19]

Lei, J., Deng, C., Shen, W.B., Guibas, L., Daniilidis, K.: Nap: Neural 3d articulated object prior. In: Adv. Neural Inform. Process. Syst. vol. 36, pp. 31878–31894 (2023)

2023
[20]

Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Dragapart: Learning a part-level motion prior for articulated objects. In: Eur. Conf. Comput. Vis. pp. 165–183. Springer (2024)

2024
[21]

Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Puppet-master: Scaling interactive video generation as a motion prior for part-level dynamics. In: Int. Conf. Comput. Vis. pp. 13405–13415 (2025)

2025
[22]

Liu, J., Iliash, D., Chang, A., Savva, M., Mahdavi Amiri, A.: Singapo: Single image controlled generation of articulated parts in objects. In: Int. Conf. Learn. Represent. (2025)

2025
[23]

Liu, J., Mahdavi-Amiri, A., Savva, M.: Paris: Part-level reconstruction and motion analysis for articulated objects. In: Int. Conf. Comput. Vis. pp. 352–363 (2023)

2023
[24]

In: IEEE Conf

Liu, J., Tam, H.I.I., Mahdavi-Amiri, A., Savva, M.: Cage: Controllable articulation generation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 17880–17889 (2024)

2024
[25]

Liu, S., Gupta, S., Wang, S.: Building rearticulable models for arbitrary 3d objects from4dpointclouds.In:IEEEConf.Comput.Vis.PatternRecog.pp.21138–21147 (2023)

2023
[26]

arXiv preprint arXiv:2509.17647 (2025)

Liu, Y., Jia, B., Lu, R., Gan, C., Chen, H., Ni, J., Zhu, S.C., Huang, S.: Videoartgs: Building digital twins of articulated objects from monocular video. arXiv preprint arXiv:2509.17647 (2025)

work page arXiv 2025
[27]

Liu,Y.,Jia,B.,Lu,R.,Ni,J.,Zhu,S.C.,Huang,S.:Buildinginteractablereplicasof complex articulated objects via gaussian splatting. In: Int. Conf. Learn. Represent. (2025)

2025
[28]

arXiv preprint arXiv:2507.05763 (2025)

Lu, R., Liu, Y., Tang, J., Ni, J., Wang, Y., Wan, D., Zeng, G., Chen, Y., Huang, S.: Dreamart: Generating interactable articulated objects from a single image. arXiv preprint arXiv:2507.05763 (2025)

work page arXiv 2025
[29]

In: IEEE Conf

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Part- net: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 909–918 (2019) PWM-ArtGen: Part World Model for Articulated Object Generation 17

2019
[30]

Mu, J., Qiu, W., Kortylewski, A., Yuille, A., Vasconcelos, N., Wang, X.: A-sdf: Learning disentangled signed distance functions for articulated shape representa- tion. In: Int. Conf. Comput. Vis. pp. 13001–13011 (2021)

2021
[31]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Trans. Mach. Learn Res. (2024)

2024
[32]

In: International Conference on 3D Vision (3DV)

Peng, W., Lv, J., Lu, C., Savva, M.: itaco: Interactable digital twins of articulated objects from casually captured rgbd videos. In: International Conference on 3D Vision (3DV). pp. 520–531 (2026)

2026
[33]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems

Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Pérez-D’Arpino, C., Buch, S., Srivastava, S., Tchapmi, L., et al.: igibson 1.0: A simulation envi- ronment for interactive tasks in large realistic scenes. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 7520–7527. IEEE (2021)

2021
[34]

In: IEEE Conf

Song, C., Wei, J., Foo, C.S., Lin, G., Liu, F.: Reacto: Reconstructing articulated objects from a single video. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5384– 5395 (2024)

2024
[35]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf. Learn. Represent. (2021)

2021
[36]

In: IEEE Conf

Su, J., Feng, Y., Li, Z., Song, J., He, Y., Ren, B., Xu, B.: Artformer: Controllable generation of diverse 3d articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1894–1904 (2025)

1904
[37]

In: International Conference on Robotics and Automa- tion

Tseng, W.C., Liao, H.J., Yen-Chen, L., Sun, M.: Cla-nerf: Category-level articu- lated neural radiance field. In: International Conference on Robotics and Automa- tion. pp. 8454–8460 (2022)

2022
[38]

In: IEEE Conf

Wang, X., Zhou, B., Shi, Y., Chen, X., Zhao, Q., Xu, K.: Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 8876–8884 (2019)

2019
[39]

In: IEEE Conf

Weng, Y., Wen, B., Tremblay, J., Blukis, V., Fox, D., Guibas, L., Birchfield, S.: Neural implicit representation for building digital twins of unknown articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3141–3150 (2024)

2024
[40]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large-scale video generative pre-training for visual robot manipulation. In: Int. Conf. Learn. Represent. (2024)

2024
[42]

Wu, R., wang, X., Liu.Liu, Guo, C.L., Qiu, J., Li, C., Huang, L., Su, Z., Cheng, M.M.: Dipo: Dual-state images controlled articulated object generation powered by diverse data. In: Adv. Neural Inform. Process. Syst. vol. 38, pp. 108665–108689 (2025)

2025
[43]

arXiv preprint arXiv:2603.16806 (2026)

Wu, Y., Lin, Y., Lao, W., Lin, Y., Wei, Y.L., Zheng, W.S., Wu, A.: Dexgrasp-zero: A morphology-aligned policy for zero-shot cross-embodiment dexterous grasping. arXiv preprint arXiv:2603.16806 (2026)

work page arXiv 2026
[44]

In: IEEE Conf

Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: Sapien: A simulated part- based interactive environment. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11094–11104 (2020)

2020
[45]

ACM Trans

Yan, Z., Hu, R., Yan, X., Chen, L., Van Kaick, O., Zhang, H., Huang, H.: Rpm-net: recurrent prediction of motion and parts from point cloud. ACM Trans. Graph. 38(6), 1–15 (2019) 18 W. Zheng and A. Wu

2019
[46]

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: Int. Conf. Learn. Represent. (2025)

2025
[47]

base". • 2) Describe how the parts are connected and then organize them in a part connectivity graph. The

Zhu, C., Yu, R., Feng, S., Burchfiel, B., Shah, P., Gupta, A.: Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. In: Proceedings of Robotics: Science and Systems (2025) PWM-ArtGen: Part World Model for Articulated Object Generation 19 Our supplementary materials provide comprehensive implementation detail...

2025
[48]

A segmentation image with region IDs
[49]

base\": [{\

A part connectivity graph (containing only: 'base', 'door', 'drawer') where: - There is exactly one base. - All doors and drawers attach directly to the base. - The child order in the graph already reflects spatial ordering and must be preserved. An example of part connectivity graph: I recognize all the articulated parts in a storage furniture, they are:...
[50]

Do NOT create or renumber IDs

Use only region IDs that appear in the segmentation result. Do NOT create or renumber IDs
[51]

Each region ID belongs to exactly one part instance (no overlaps)
[52]

base": {

If a single part spans multiple region IDs, include all of them in that part’s `ids` array. Here is an example of your response: I recognize all the articulated parts of a storage furniture, I recognize all the articulated parts in a storage furniture, they are: base[<ids>], door[<ids>] (attach to base), drawer[<ids>] (attach to base). ```json {"base": {"...

work page arXiv 1948

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Agarwal, N., Ali, A., Bala, M., Balaji, Y., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y., Cui, Y., Ding, Y., et al.: Cosmos world foundation model platform for physical ai. arXiv preprint arXiv:2501.03575 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Revisiting Feature Prediction for Learning Visual Representations from Video

Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Brooks, T., Peebles, B., Holmes, C., DePue, W., Guo, Y., Jing, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video gener- ation models as world simulators (2024),https://openai.com/research/video- generation-models-as-world-simulators, accessed: 29 June 2026

2024

[4] [4]

Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y.J., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., Aytar, Y., Bechtle, S., Behbahani, F., Chan, S., Heess, N., Gonzalez, L., Osindero, S., Ozair, S., Reed, S., Zhang, J., Zolna, K., Clune, J., De Freitas, N., Singh, S., Rocktäschel, T.: Genie: generative interactive environment...

2024

[5] [5]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

Chen, C., Liu, I., Wei, X., Su, H., Liu, M.: Freeart3d: Training-free articulated object generation using 3d diffusion. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

2025

[6] [6]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

Chen, H., Lan, Y., Chen, Y., Pan, X.: Artilatent: Realistic articulated 3d object generation via structured latents. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

2025

[7] [7]

In: Proceedings of Robotics: Science and Systems (2024)

Chen, Z., Walsman, A., Memmel, M., Mo, K., Fang, A., Vemuri, K., Wu, A., Fox, D., Gupta, A.: Urdformer: A pipeline for constructing articulated simulation environments from real-world images. In: Proceedings of Robotics: Science and Systems (2024)

2024

[8] [8]

In: IEEE Conf

Collins, J., Goel, S., Deng, K., Luthra, A., Xu, L., Gundogdu, E., Zhang, X., Vicente, T.F.Y., Dideriksen, T., Arora, H., Guillaumin, M., Malik, J.: Abo: Dataset and benchmarks for real-world 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21094–21104 (2022)

2022

[9] [9]

In: IEEE Conf

Geng, H., Xu, H., Zhao, C., Xu, C., Yi, L., Huang, S., Wang, H.: Gapartnet: Cross-category domain-generalizable object perception and manipulation via gen- eralizable and actionable parts. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7081–7091 (2023)

2023

[10] [10]

Guo, Y., Hu, Y., Zhang, J., Wang, Y.J., Chen, X., Lu, C., Chen, J.: Prediction with action: Visual policy learning via joint denoising process. In: Adv. Neural Inform. Process. Syst. vol. 37, pp. 112386–112410 (2024)

2024

[11] [11]

In: IEEE Conf

Heppert, N., Irshad, M.Z., Zakharov, S., Liu, K., Ambrus, R.A., Bohg, J., Valada, A., Kollar, T.: Carto: Category and joint agnostic reconstruction of articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 21201–21210 (2023)

2023

[12] [12]

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv. Neural Inform. Process. Syst. vol. 33, pp. 6840–6851 (2020) 16 W. Zheng and A. Wu

2020

[13] [13]

ACM Trans

Hu, R., Li, W., Van Kaick, O., Shamir, A., Zhang, H., Huang, H.: Learning to predict part mobility from a single static snapshot. ACM Trans. Graph.36(6), 1–13 (2017)

2017

[14] [14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Iliash, D., Jiang, H., Zhang, Y., Savva, M., Chang, A.X.: S2o: Static to openable enhancement for articulated 3d objects. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6785–6795 (2026)

2026

[16] [16]

In: IEEE Conf

Jiang, Z., Hsu, C.C., Zhu, Y.: Ditto: Building digital twins of articulated objects from interaction. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5616–5626 (2022)

2022

[17] [17]

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: Int. Conf. Comput. Vis. (2023)

2023

[18] [18]

Le, L., Xie, J., Liang, W., Wang, H.J., Yang, Y., Ma, Y.J., Vedder, K., Krishna, A., Jayaraman, D., Eaton, E.: Articulate-anything: Automatic modeling of articulated objects via a vision-language foundation model. In: Int. Conf. Learn. Represent. (2025)

2025

[19] [19]

Lei, J., Deng, C., Shen, W.B., Guibas, L., Daniilidis, K.: Nap: Neural 3d articulated object prior. In: Adv. Neural Inform. Process. Syst. vol. 36, pp. 31878–31894 (2023)

2023

[20] [20]

Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Dragapart: Learning a part-level motion prior for articulated objects. In: Eur. Conf. Comput. Vis. pp. 165–183. Springer (2024)

2024

[21] [21]

Li, R., Zheng, C., Rupprecht, C., Vedaldi, A.: Puppet-master: Scaling interactive video generation as a motion prior for part-level dynamics. In: Int. Conf. Comput. Vis. pp. 13405–13415 (2025)

2025

[22] [22]

Liu, J., Iliash, D., Chang, A., Savva, M., Mahdavi Amiri, A.: Singapo: Single image controlled generation of articulated parts in objects. In: Int. Conf. Learn. Represent. (2025)

2025

[23] [23]

Liu, J., Mahdavi-Amiri, A., Savva, M.: Paris: Part-level reconstruction and motion analysis for articulated objects. In: Int. Conf. Comput. Vis. pp. 352–363 (2023)

2023

[24] [24]

In: IEEE Conf

Liu, J., Tam, H.I.I., Mahdavi-Amiri, A., Savva, M.: Cage: Controllable articulation generation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 17880–17889 (2024)

2024

[25] [25]

Liu, S., Gupta, S., Wang, S.: Building rearticulable models for arbitrary 3d objects from4dpointclouds.In:IEEEConf.Comput.Vis.PatternRecog.pp.21138–21147 (2023)

2023

[26] [26]

arXiv preprint arXiv:2509.17647 (2025)

Liu, Y., Jia, B., Lu, R., Gan, C., Chen, H., Ni, J., Zhu, S.C., Huang, S.: Videoartgs: Building digital twins of articulated objects from monocular video. arXiv preprint arXiv:2509.17647 (2025)

work page arXiv 2025

[27] [27]

Liu,Y.,Jia,B.,Lu,R.,Ni,J.,Zhu,S.C.,Huang,S.:Buildinginteractablereplicasof complex articulated objects via gaussian splatting. In: Int. Conf. Learn. Represent. (2025)

2025

[28] [28]

arXiv preprint arXiv:2507.05763 (2025)

Lu, R., Liu, Y., Tang, J., Ni, J., Wang, Y., Wan, D., Zeng, G., Chen, Y., Huang, S.: Dreamart: Generating interactable articulated objects from a single image. arXiv preprint arXiv:2507.05763 (2025)

work page arXiv 2025

[29] [29]

In: IEEE Conf

Mo, K., Zhu, S., Chang, A.X., Yi, L., Tripathi, S., Guibas, L.J., Su, H.: Part- net: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 909–918 (2019) PWM-ArtGen: Part World Model for Articulated Object Generation 17

2019

[30] [30]

Mu, J., Qiu, W., Kortylewski, A., Yuille, A., Vasconcelos, N., Wang, X.: A-sdf: Learning disentangled signed distance functions for articulated shape representa- tion. In: Int. Conf. Comput. Vis. pp. 13001–13011 (2021)

2021

[31] [31]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Trans. Mach. Learn Res. (2024)

2024

[32] [32]

In: International Conference on 3D Vision (3DV)

Peng, W., Lv, J., Lu, C., Savva, M.: itaco: Interactable digital twins of articulated objects from casually captured rgbd videos. In: International Conference on 3D Vision (3DV). pp. 520–531 (2026)

2026

[33] [33]

In: IEEE/RSJ International Conference on Intelligent Robots and Systems

Shen, B., Xia, F., Li, C., Martín-Martín, R., Fan, L., Wang, G., Pérez-D’Arpino, C., Buch, S., Srivastava, S., Tchapmi, L., et al.: igibson 1.0: A simulation envi- ronment for interactive tasks in large realistic scenes. In: IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 7520–7527. IEEE (2021)

2021

[34] [34]

In: IEEE Conf

Song, C., Wei, J., Foo, C.S., Lin, G., Liu, F.: Reacto: Reconstructing articulated objects from a single video. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 5384– 5395 (2024)

2024

[35] [35]

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf. Learn. Represent. (2021)

2021

[36] [36]

In: IEEE Conf

Su, J., Feng, Y., Li, Z., Song, J., He, Y., Ren, B., Xu, B.: Artformer: Controllable generation of diverse 3d articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 1894–1904 (2025)

1904

[37] [37]

In: International Conference on Robotics and Automa- tion

Tseng, W.C., Liao, H.J., Yen-Chen, L., Sun, M.: Cla-nerf: Category-level articu- lated neural radiance field. In: International Conference on Robotics and Automa- tion. pp. 8454–8460 (2022)

2022

[38] [38]

In: IEEE Conf

Wang, X., Zhou, B., Shi, Y., Chen, X., Zhao, Q., Xu, K.: Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 8876–8884 (2019)

2019

[39] [39]

In: IEEE Conf

Weng, Y., Wen, B., Tremblay, J., Blukis, V., Fox, D., Guibas, L., Birchfield, S.: Neural implicit representation for building digital twins of unknown articulated objects. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3141–3150 (2024)

2024

[40] [40]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Wu, H., Jing, Y., Cheang, C., Chen, G., Xu, J., Li, X., Liu, M., Li, H., Kong, T.: Unleashing large-scale video generative pre-training for visual robot manipulation. In: Int. Conf. Learn. Represent. (2024)

2024

[42] [42]

Wu, R., wang, X., Liu.Liu, Guo, C.L., Qiu, J., Li, C., Huang, L., Su, Z., Cheng, M.M.: Dipo: Dual-state images controlled articulated object generation powered by diverse data. In: Adv. Neural Inform. Process. Syst. vol. 38, pp. 108665–108689 (2025)

2025

[43] [43]

arXiv preprint arXiv:2603.16806 (2026)

Wu, Y., Lin, Y., Lao, W., Lin, Y., Wei, Y.L., Zheng, W.S., Wu, A.: Dexgrasp-zero: A morphology-aligned policy for zero-shot cross-embodiment dexterous grasping. arXiv preprint arXiv:2603.16806 (2026)

work page arXiv 2026

[44] [44]

In: IEEE Conf

Xiang, F., Qin, Y., Mo, K., Xia, Y., Zhu, H., Liu, F., Liu, M., Jiang, H., Yuan, Y., Wang, H., Yi, L., Chang, A.X., Guibas, L.J., Su, H.: Sapien: A simulated part- based interactive environment. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11094–11104 (2020)

2020

[45] [45]

ACM Trans

Yan, Z., Hu, R., Yan, X., Chen, L., Van Kaick, O., Zhang, H., Huang, H.: Rpm-net: recurrent prediction of motion and parts from point cloud. ACM Trans. Graph. 38(6), 1–15 (2019) 18 W. Zheng and A. Wu

2019

[46] [46]

Yu, S., Kwak, S., Jang, H., Jeong, J., Huang, J., Shin, J., Xie, S.: Representation alignment for generation: Training diffusion transformers is easier than you think. In: Int. Conf. Learn. Represent. (2025)

2025

[47] [47]

base". • 2) Describe how the parts are connected and then organize them in a part connectivity graph. The

Zhu, C., Yu, R., Feng, S., Burchfiel, B., Shah, P., Gupta, A.: Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. In: Proceedings of Robotics: Science and Systems (2025) PWM-ArtGen: Part World Model for Articulated Object Generation 19 Our supplementary materials provide comprehensive implementation detail...

2025

[48] [48]

A segmentation image with region IDs

[49] [49]

base\": [{\

A part connectivity graph (containing only: 'base', 'door', 'drawer') where: - There is exactly one base. - All doors and drawers attach directly to the base. - The child order in the graph already reflects spatial ordering and must be preserved. An example of part connectivity graph: I recognize all the articulated parts in a storage furniture, they are:...

[50] [50]

Do NOT create or renumber IDs

Use only region IDs that appear in the segmentation result. Do NOT create or renumber IDs

[51] [51]

Each region ID belongs to exactly one part instance (no overlaps)

[52] [52]

base": {

If a single part spans multiple region IDs, include all of them in that part’s `ids` array. Here is an example of your response: I recognize all the articulated parts of a storage furniture, I recognize all the articulated parts in a storage furniture, they are: base[<ids>], door[<ids>] (attach to base), drawer[<ids>] (attach to base). ```json {"base": {"...

work page arXiv 1948