pith. machine review for the scientific record. sign in

arxiv: 2605.08054 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

Fang-Lue Zhang, Hanchao Liu, Shi-Min Hu, Shining Zhang, Tai-Jiang Mu

Pith reviewed 2026-05-11 02:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords human motion generationdiffusion modelsretrieval guidanceconstrained generationzero-shot tasksnoise optimizationLLM parsing
0
0 comments X

The pith

Retrieval-guided noise initialization enables diffusion models to satisfy severe spatiotemporal constraints in human motion generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that searching large motion datasets for reference examples and using them to shape the starting noise in diffusion optimization can overcome the limitations of current methods on very hard custom constraints. This is important because it would allow motion generators to handle tasks like avoiding specific obstacles or walking exact numbers of steps without needing to retrain the underlying model. The approach relies on parsing the task with an LLM to spot which constraints are toughest, then creating a blended noise mask that mixes random noise with noise derived from the retrieved motions based on how well they match rewards. If this works, it extends the reach of training-free diffusion techniques to more realistic and complex animation scenarios.

Core claim

By introducing relational task parsing to identify difficult constraints and a reward-guided mask to combine retrieved reference noise with random noise for better initialization, optimizing the diffusion noise from this point allows the generation of human motions that meet highly challenging zero-shot goal functions, such as those involving severe spatial obstacles or precise step counts.

What carries the argument

The reward-guided mask that blends random diffusion noise with noise from retrieved reference motions to create an improved initialization for the training-free diffusion noise optimization process.

If this is right

  • It enables solving tasks with severe spatial obstacles or specified numbers of walking steps that defeat prior methods.
  • LLM-based relational parsing allows automatic reasoning about what references to retrieve for a given task.
  • The training-free scheme keeps the method applicable without additional model training.
  • Applications in controllable character animation and virtual agent behavior synthesis become more feasible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach might connect to retrieval-augmented generation techniques used in other AI domains like language or image synthesis.
  • Testing the method on constraints requiring motions not well-represented in existing datasets could reveal its boundaries.
  • Extending the relational parsing to handle multi-agent or interactive scenarios could be a natural next step.

Load-bearing premise

Suitable reference motions for the difficult constraints exist in the available datasets and can be identified and combined effectively through LLM parsing and reward-guided masking.

What would settle it

A counterexample would be a set of highly constrained tasks where the method, even with the retrieval-guided initialization, produces motions that violate the specified spatiotemporal constraints at a similar rate to standard diffusion noise optimization without retrieval.

Figures

Figures reproduced from arXiv: 2605.08054 by Fang-Lue Zhang, Hanchao Liu, Shi-Min Hu, Shining Zhang, Tai-Jiang Mu.

Figure 1
Figure 1. Figure 1: Training-free Human Motion Generation for Highly￾constrained Generation Tasks. Compared to existing diffusion noise optimization methods [17, 24] which exhibit high constraint error and motion artifacts, our proposed Retrieval-Guided Diffu￾sion Noise Optimization significantly improves performance on these tasks. This improvement is achieved by leveraging relevant skills retrieved from existing motion data… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Retrieval-Guided Diffusion Noise Optimization. Given a motion generation task represented by a combined constraint function FC , we apply either manual or LLM-based relational task parsing to group difficult constraints CR for retrieving potential skills xR, and to group the remaining constraints into subsets C1 and C2 that can be handled respectively by random and retrieved noises. Using a mot… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples for various highly-constrained generation tasks. The relational task parsing results are obtained via LLM. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison. (a) ProgMoGen+DNO pro [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance on different levels of task difficulty. (a) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples for Task: hand reaches a very high position with different types of constraints. (a) Spatial constraint: the target is located at z = 2.5 along the walking path. (b) Numer￾ical constraint: the target is located at the origin and the goal is to reach the target position three times. MaskControl Ours Task-2: very low barrier (joint-based) [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison with MaskControl for Task-2 in the joint [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Failure cases. Some generated motions of our method [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Automatic LLM-based relational task parsing. The above shows the instruction comprising a task description, reasoning rules, [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a retrieval-guided method for highly-constrained zero-shot human motion generation built on training-free diffusion noise optimization. It introduces LLM-based relational task parsing to group constraints and flag difficult sub-tasks, retrieves reference motions from large datasets, and constructs an improved diffusion-noise initialization via a reward-guided mask that blends retrieved noise with random noise. Optimizing from this initialization is claimed to solve spatiotemporal tasks (e.g., severe obstacles or exact step counts) that standard diffusion generators cannot handle.

Significance. If the retrieval and masking steps reliably produce initializations that allow noise optimization to succeed on tasks where plain diffusion fails, the work would provide a practical, training-free route to more controllable motion synthesis for animation and virtual agents. The combination of dataset retrieval with LLM reasoning for constraint decomposition is a plausible way to inject external knowledge without retraining.

major comments (2)
  1. [Abstract] Abstract: the assertion that the method 'successfully solve[s] highly constrained generation tasks' is stated without any quantitative results, success rates, baseline comparisons, ablation studies, or metrics for constraint satisfaction. This is load-bearing for the central claim that the improved initialization outperforms standard diffusion on difficult spatiotemporal constraints.
  2. [Method] Method (relational task parsing and retrieval pipeline): the advantage of the reward-guided mask rests on the unverified assumptions that (i) sufficiently close reference motions exist in the dataset for arbitrary novel constraint sets and (ii) the LLM parser correctly identifies and groups difficult sub-tasks. No retrieval-precision statistics, coverage analysis, or failure-case handling are described, leaving open the possibility that the initialization offers no benefit over the baseline the paper says cannot solve these tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, providing clarifications and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion that the method 'successfully solve[s] highly constrained generation tasks' is stated without any quantitative results, success rates, baseline comparisons, ablation studies, or metrics for constraint satisfaction. This is load-bearing for the central claim that the improved initialization outperforms standard diffusion on difficult spatiotemporal constraints.

    Authors: We agree that the abstract would benefit from explicitly referencing the supporting quantitative evidence. The full manuscript includes experiments (Section 4) with success rates on spatiotemporal constraint satisfaction, direct comparisons to standard diffusion noise optimization baselines, and ablations isolating the retrieval and reward-guided mask components. To address the concern, we have revised the abstract to include a concise summary of these key empirical results supporting the central claim. revision: yes

  2. Referee: [Method] Method (relational task parsing and retrieval pipeline): the advantage of the reward-guided mask rests on the unverified assumptions that (i) sufficiently close reference motions exist in the dataset for arbitrary novel constraint sets and (ii) the LLM parser correctly identifies and groups difficult sub-tasks. No retrieval-precision statistics, coverage analysis, or failure-case handling are described, leaving open the possibility that the initialization offers no benefit over the baseline the paper says cannot solve these tasks.

    Authors: The referee correctly notes that the method's effectiveness depends on dataset coverage and LLM parsing reliability. The manuscript uses established large-scale motion datasets (e.g., those containing thousands of diverse sequences) and LLM-based relational parsing to identify and group difficult sub-tasks, with the reward-guided mask designed to blend relevant retrieved noise. We acknowledge the absence of explicit retrieval-precision statistics or coverage analysis. We have made a partial revision by adding a discussion subsection on dataset coverage assumptions, LLM parsing examples, and qualitative failure-case handling; a full quantitative retrieval analysis would require new experiments and is noted as a direction for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: procedural pipeline without derivations or self-referential reductions

full rationale

The paper describes a retrieval-guided diffusion noise optimization method as a sequence of steps: LLM-based relational task parsing to identify difficult constraints, retrieval of reference motions from datasets, reward-guided masking to create improved noise initialization, and subsequent optimization. No equations, mathematical derivations, fitted parameters, or predictions are present that reduce any claim to its own inputs by construction. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The method is self-contained as an engineering pipeline whose success depends on external dataset coverage and LLM reliability rather than internal definitional loops. This matches the default expectation of no significant circularity for descriptive procedural papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes a high-level algorithmic pipeline but contains no mathematical derivations, fitted parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5486 in / 1054 out tokens · 38215 ms · 2026-05-11T02:13:47.790097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 2 internal anchors

  1. [1]

    Sinc: Spatial composition of 3d human motions for simultaneous action generation

    Nikos Athanasiou, Mathis Petrovich, Michael J Black, and G¨ul Varol. Sinc: Spatial composition of 3d human motions for simultaneous action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9984–9995, 2023. 2

  2. [2]

    Pose-guided motion diffusion model for text- to-motion generation

    Xinhao Cai, Minghang Zheng, Qingchao Chen, Yuxin Peng, and Yang Liu. Pose-guided motion diffusion model for text- to-motion generation. 2

  3. [3]

    Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing

    Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- tual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 22560–22570, 2023. 2

  4. [4]

    Executing your commands via motion diffusion in latent space

    Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18000–18010, 2023. 2, 3

  5. [5]

    Gmt: General motion tracking for humanoid whole-body control,

    Zixuan Chen, Mazeyu Ji, Xuxin Cheng, Xuanbin Peng, Xue Bin Peng, and Xiaolong Wang. Gmt: Gen- eral motion tracking for humanoid whole-body control. arXiv:2506.14770, 2025. 1

  6. [6]

    Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024

    Luca Eyring, Shyamgopal Karthik, Karsten Roth, Alexey Dosovitskiy, and Zeynep Akata. Reno: Enhancing one-step text-to-image models through reward-based noise optimiza- tion.Advances in Neural Information Processing Systems, 37:125487–125519, 2024. 2, 8

  7. [7]

    Go to zero: Towards zero-shot motion generation with million-scale data

    Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 13336– 13348, 2025. 2, 4

  8. [8]

    Generating diverse and natural 3d human motions from text

    Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Generating diverse and natural 3d human motions from text. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5152–5161, 2022. 3, 6

  9. [9]

    Momask: Generative masked model- ing of 3d human motions

    Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked model- ing of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024. 6

  10. [10]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 7, 1

  11. [11]

    Initno: Boosting text-to-image diffu- sion models via initial noise optimization

    Xiefan Guo, Jinlin Liu, Miaomiao Cui, Jiankai Li, Hongyu Yang, and Di Huang. Initno: Boosting text-to-image diffu- sion models via initial noise optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9380–9389, 2024. 2

  12. [12]

    Atom: Aligning text-to-motion model at event-level with gpt-4vision reward

    Haonan Han, Xiangzuo Wu, Huan Liao, Zunnan Xu, Zhongyuan Hu, Ronghui Li, Yachao Zhang, and Xiu Li. Atom: Aligning text-to-motion model at event-level with gpt-4vision reward. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22746–22755,

  13. [13]

    A causal convolutional neural network for multi-subject mo- tion modeling and generation.Computational Visual Media, 10(1):45–59, 2024

    Shuaiying Hou, Congyi Wang, Wenlin Zhuang, Yu Chen, Yangang Wang, Hujun Bao, Jinxiang Chai, and Weiwei Xu. A causal convolutional neural network for multi-subject mo- tion modeling and generation.Computational Visual Media, 10(1):45–59, 2024. 2

  14. [14]

    Como: Controllable motion generation through language guided pose code edit- ing

    Yiming Huang, Weilin Wan, Yue Yang, Chris Callison- Burch, Mark Yatskar, and Lingjie Liu. Como: Controllable motion generation through language guided pose code edit- ing. InEuropean Conference on Computer Vision, pages 180–196. Springer, 2024. 2

  15. [15]

    Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023

    Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign lan- guage.Advances in Neural Information Processing Systems, 36:20067–20079, 2023. 2

  16. [16]

    Guided motion diffusion for controllable human motion synthesis

    Korrawe Karunratanakul, Konpat Preechakul, Supasorn Suwajanakorn, and Siyu Tang. Guided motion diffusion for controllable human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2151–2162, 2023. 1, 2

  17. [17]

    Opti- mizing diffusion noise can serve as universal motion priors

    Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan, Thabo Beeler, Supasorn Suwajanakorn, and Siyu Tang. Opti- mizing diffusion noise can serve as universal motion priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1334–1345, 2024. 1, 2, 3, 5, 6, 7, 4

  18. [18]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  19. [19]

    Count- cluster: Training-free object quantity guidance with cross- attention map clustering for text-to-image generation.arXiv preprint arXiv:2508.10710, 2025

    Joohyeon Lee, Jin-Seop Lee, and Jee-Hyong Lee. Count- cluster: Training-free object quantity guidance with cross- attention map clustering for text-to-image generation.arXiv preprint arXiv:2508.10710, 2025. 2

  20. [20]

    Example-based motion synthesis via generative motion matching.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023

    Weiyu Li, Xuelin Chen, Peizhuo Li, Olga Sorkine-Hornung, and Baoquan Chen. Example-based motion synthesis via generative motion matching.ACM Transactions on Graphics (TOG), 42(4):1–12, 2023. 2

  21. [21]

    Simmotionedit: Text-based human motion editing with motion similarity pre- diction

    Zhengyuan Li, Kai Cheng, Anindita Ghosh, Uttaran Bhat- tacharya, Liangyan Gui, and Aniket Bera. Simmotionedit: Text-based human motion editing with motion similarity pre- diction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27827–27837, 2025. 2

  22. [22]

    Re- momask: Retrieval-augmented masked motion generation

    Zhengdao Li, Siheng Wang, Zeyu Zhang, and Hao Tang. Re- momask: Retrieval-augmented masked motion generation. arXiv preprint arXiv:2508.02605, 2025. 2

  23. [23]

    Rmd: A simple baseline for more general human motion generation via training- free retrieval-augmented motion diffuse.arXiv preprint arXiv:2412.04343, 2024

    Zhouyingcheng Liao, Mingyuan Zhang, Wenjia Wang, Lei Yang, and Taku Komura. Rmd: A simple baseline for more general human motion generation via training- free retrieval-augmented motion diffuse.arXiv preprint arXiv:2412.04343, 2024. 2 9

  24. [24]

    Programmable motion generation for open- set motion control tasks

    Hanchao Liu, Xiaohang Zhan, Shaoli Huang, Tai-Jiang Mu, and Ying Shan. Programmable motion generation for open- set motion control tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1399–1408, 2024. 1, 2, 3, 4, 5, 6, 7

  25. [25]

    Smpl: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6),

    Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi- person linear model.ACM Transactions on Graphics, 34(6),

  26. [26]

    Amass: Archive of motion capture as surface shapes

    Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, Ger- ard Pons-Moll, and Michael J Black. Amass: Archive of motion capture as surface shapes. InProceedings of the IEEE/CVF international conference on computer vision, pages 5442–5451, 2019. 6

  27. [27]

    Generation of com- plex 3d human motion by temporal and spatial composition of diffusion models

    Lorenzo Mandelli and Stefano Berretti. Generation of com- plex 3d human motion by temporal and spatial composition of diffusion models. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1279–

  28. [28]

    Noise diffusion for en- hancing semantic faithfulness in text-to-image synthesis

    Boming Miao, Chunxiao Li, Xiaoxiao Wang, Andi Zhang, Rui Sun, Zizhe Wang, and Yao Zhu. Noise diffusion for en- hancing semantic faithfulness in text-to-image synthesis. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 23575–23584, 2025. 2, 4

  29. [29]

    To- kenhsi: Unified synthesis of physical human-scene inter- actions through task tokenization

    Liang Pan, Zeshi Yang, Zhiyang Dou, Wenjia Wang, Buzhen Huang, Bo Dai, Taku Komura, and Jingbo Wang. To- kenhsi: Unified synthesis of physical human-scene inter- actions through task tokenization. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5379–5391, 2025. 1

  30. [30]

    Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis

    Mathis Petrovich, Michael J Black, and G ¨ul Varol. Tmr: Text-to-motion retrieval using contrastive 3d human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9488–9497, 2023. 5

  31. [31]

    Multi-track timeline control for text-driven 3d human motion genera- tion

    Mathis Petrovich, Or Litany, Umar Iqbal, Michael J Black, Gul Varol, Xue Bin Peng, and Davis Rempe. Multi-track timeline control for text-driven 3d human motion genera- tion. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1911–

  32. [32]

    IEEE Computer Society, 2024. 2, 4

  33. [33]

    Autoedit: Automatic hyperparameter tuning for image editing.arXiv preprint arXiv:2509.15031, 2025

    Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dim- itris Metaxas, and David Doermann. Autoedit: Automatic hyperparameter tuning for image editing.arXiv preprint arXiv:2509.15031, 2025. 2

  34. [34]

    Maskcon- trol: Spatio-temporal control for masked motion synthesis

    Ekkasit Pinyoanuntapong, Muhammad Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, and Sergey Tulyakov. Maskcon- trol: Spatio-temporal control for masked motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9955–9965, 2025. 1, 2, 6, 4

  35. [35]

    arXiv preprint arXiv:2407.14041 , year=

    Zipeng Qi, Lichen Bai, Haoyi Xiong, and Zeke Xie. Not all noises are created equally: Diffusion noise selection and optimization.arXiv preprint arXiv:2407.14041, 2024. 2

  36. [36]

    Human motion diffusion as a generative prior

    Yoni Shafir, Guy Tevet, Roy Kapon, and Amit Haim Bermano. Human motion diffusion as a generative prior. InThe Twelfth International Conference on Learning Rep- resentations, 2023. 2

  37. [37]

    Denois- ing diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2020. 3

  38. [38]

    Sopo: Text-to-motion generation using semi-online prefer- ence optimization

    Xiaofeng Tan, Hongsong Wang, Xin Geng, and Pan Zhou. Sopo: Text-to-motion generation using semi-online prefer- ence optimization. InAdvances in Neural Information Pro- cessing Systems, 2025. 2

  39. [39]

    Human motion diffu- sion model

    Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and Amit Haim Bermano. Human motion diffu- sion model. InThe Eleventh International Conference on Learning Representations, 2022. 2, 3, 6

  40. [40]

    Tlcontrol: Trajec- tory and language control for human motion synthesis

    Weilin Wan, Zhiyang Dou, Taku Komura, Wenping Wang, Dinesh Jayaraman, and Lingjie Liu. Tlcontrol: Trajec- tory and language control for human motion synthesis. In European Conference on Computer Vision, pages 37–54. Springer, 2024. 1, 2

  41. [41]

    Diffusion models for 3d generation: A survey

    Chen Wang, Hao-Yang Peng, Ying-Tian Liu, Jiatao Gu, and Shi-Min Hu. Diffusion models for 3d generation: A survey. Computational Visual Media, 11(1):1–28, 2025. 2

  42. [42]

    Aligning human motion genera- tion with human perceptions

    Haoru Wang, Wentao Zhu, Luyi Miao, Yishu Xu, Feng Gao, Qi Tian, and Yizhou Wang. Aligning human motion genera- tion with human perceptions. InThe Thirteenth International Conference on Learning Representations, 2025. 2, 8

  43. [43]

    Sims: Simulating stylized human- scene interactions with retrieval-augmented script genera- tion

    Wenjia Wang, Liang Pan, Zhiyang Dou, Jidong Mei, Zhouy- ingcheng Liao, Yuke Lou, Yifan Wu, Lei Yang, Jingbo Wang, and Taku Komura. Sims: Simulating stylized human- scene interactions with retrieval-augmented script genera- tion. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14117–14127, 2025. 2

  44. [44]

    Cannyedit: Se- lective canny control and dual-prompt guidance for training- free image editing.arXiv preprint arXiv:2508.06937, 2025

    Weiyan Xie, Han Gao, Didan Deng, Kaican Li, April Hua Liu, Yongxiang Huang, and Nevin L Zhang. Cannyedit: Se- lective canny control and dual-prompt guidance for training- free image editing.arXiv preprint arXiv:2508.06937, 2025. 2

  45. [45]

    Omnicontrol: Control any joint at any time for human motion generation

    Yiming Xie, Varun Jampani, Lei Zhong, Deqing Sun, and Huaizu Jiang. Omnicontrol: Control any joint at any time for human motion generation. InThe Twelfth International Conference on Learning Representations, 2023. 1, 2

  46. [46]

    Vimorag: Video-based retrieval-augmented 3d motion generation for motion language models.arXiv preprint arXiv:2508.12081, 2025

    Haidong Xu, Guangwei Xu, Zhedong Zheng, Xiatian Zhu, Wei Ji, Xiangtai Li, Ruijie Guo, Meishan Zhang, Hao Fei, et al. Vimorag: Video-based retrieval-augmented 3d mo- tion generation for motion language models.arXiv preprint arXiv:2508.12081, 2025. 2

  47. [47]

    Good seed makes a good crop: Discovering secret seeds in text-to- image diffusion models

    Katherine Xu, Lingzhi Zhang, and Jianbo Shi. Good seed makes a good crop: Discovering secret seeds in text-to- image diffusion models. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 3024–3034. IEEE, 2025. 2

  48. [48]

    Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024

    Xinyu Xu, Yizheng Zhang, Yong-Lu Li, Lei Han, and Cewu Lu. Humanvla: Towards vision-language directed object re- arrangement by physical humanoid.Advances in Neural In- formation Processing Systems, 37:18633–18659, 2024. 1

  49. [49]

    Re- modiffuse: Retrieval-augmented motion diffusion model

    Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Re- modiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 2 10

  50. [50]

    Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024

    Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondif- fuse: Text-driven human motion generation with diffusion model.IEEE transactions on pattern analysis and machine intelligence, 46(6):4115–4128, 2024. 2

  51. [51]

    Rohm: Robust human motion reconstruction via diffusion

    Siwei Zhang, Bharat Lal Bhatnagar, Yuanlu Xu, Alexan- der Winkler, Petr Kadlecek, Siyu Tang, and Federica Bogo. Rohm: Robust human motion reconstruction via diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14606–14617, 2024. 3, 6

  52. [52]

    a man walks forwards

    Kaifeng Zhao, Gen Li, and Siyu Tang. Dartcontrol: A diffusion-based autoregressive motion model for real-time text-driven motion control. InThe Thirteenth International Conference on Learning Representations, 2025. 2 11 Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization Supplementary Material This supplem...