Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

Anton Bozhedarov; Dmitry V. Dylov; Maksim Bobrin; Nazar Buzun; Nikolay Shvetsov

arxiv: 2605.29906 · v2 · pith:WT2WGZT4new · submitted 2026-05-28 · 💻 cs.LG

Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

Nikolay Shvetsov , Maksim Bobrin , Nazar Buzun , Anton Bozhedarov , Dmitry V. Dylov This is my paper

Pith reviewed 2026-06-29 08:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords text-to-motion generationbehavioral foundation modelsvariational behavioral bottlenecklatent policy spacecompositional motionmotion planningfrozen pretrained models

0 comments

The pith

Text2BFM aligns language with frozen behavioral foundation models in a compressed latent manifold to generate long composite motions without direct pose synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that text-to-motion generation improves when semantic planning occurs separately from motion execution by operating inside the latent policy space of a pretrained behavioral foundation model. A variational bottleneck aligns these internal representations with natural language while keeping long-horizon behavioral structure intact, allowing a lightweight generator to produce motions that the frozen model then executes. Existing direct methods couple interpretation, planning, and physical realization in one model, which becomes costly and unreliable for extended or detailed prompts. By freezing the foundation model and adding only the bottleneck plus generator, the framework reduces training demands and improves robustness on compositional descriptions. This separation makes the approach practical for applications needing coherent sequences over many steps.

Core claim

Text2BFM is the first framework that aligns natural language with pretrained Behavioral Foundation Models for T2M generation without relying on heavy end-to-end motion generators. It operates in the latent policy space of a frozen BFM, using a text-aligned variational behavioral bottleneck to compress policy-latent sequences into compact motion representations that remain compatible with language and preserve long-horizon behavioral structure. Generation occurs in this compact behavioral manifold with a lightweight conditional generator, after which the resulting latent encoded behaviors are decoded into policy latents that drive the pretrained frozen BFM.

What carries the argument

The text-aligned variational behavioral bottleneck, which compresses BFM policy-latent sequences into compact, language-compatible motion representations while preserving long-horizon behavioral structure.

If this is right

Text2BFM achieves efficient and robust T2M generation by keeping the foundation model frozen.
The method delivers strong performance on long, compositional textual descriptions.
Semantic planning is decoupled from low-level motion execution.
The frozen BFM serves as an executable motion prior without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Behavioral foundation models could be reused across multiple language-conditioned tasks by swapping only the bottleneck and generator.
The same compression approach might extend to other sequential control domains where long-horizon structure must align with external instructions.
Treating motion planning as a separate semantic compression step could reduce the need for ever-larger end-to-end models in animation and robotics.

Load-bearing premise

A variational bottleneck can compress BFM policy-latent sequences into compact representations that stay compatible with language while preserving the essential long-horizon behavioral structure.

What would settle it

Showing that motions produced for long compositional prompts either fail to follow the described action sequence or perform no better in coherence and efficiency than direct end-to-end pose generators.

Figures

Figures reproduced from arXiv: 2605.29906 by Anton Bozhedarov, Dmitry V. Dylov, Maksim Bobrin, Nazar Buzun, Nikolay Shvetsov.

**Figure 2.** Figure 2: Text2BFM method and its principal diagram components. Shown are the training (steps 1 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of methods for compositional motion generation. The text prompt is “A person [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Hierarchical visualization of motion categories in the dataset. The dataset was constructed [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Text-to-motion (T2M) generation has broad applications in character animation, virtual avatars, and human-robot interaction. Existing methods typically generate pose trajectories or motion tokens directly from language, forcing a single model to handle semantic interpretation, long-horizon structure, and low-level physical realization. This coupling makes them costly and often unreliable for long, compositional, or semantically dense prompts. We propose Text2BFM, the first framework that aligns natural language with pretrained Behavioral Foundation Models (BFMs) for T2M generation without relying on heavy end-to-end motion generators. Text2BFM operates in the latent policy space of a frozen BFM, using it as an executable motion prior. A text-aligned variational behavioral bottleneck compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure. Generation is performed in this compact behavioral manifold with a lightweight conditional generator, and the resulting latent encoded behaviors are decoded into policy latents that drive the pretrained frozen BFM. By decoupling semantic planning from motion execution, Text2BFM achieves efficient, robust T2M generation and strong performance on long, compositional textual descriptions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Text2BFM decouples semantic planning from BFM execution via a text-aligned variational bottleneck, but the abstract supplies no equations, metrics, or implementation details to check whether the claims hold.

read the letter

Hi,

The main point is that this paper introduces Text2BFM, which works in the latent policy space of a frozen BFM and uses a text-aligned variational behavioral bottleneck to create compact, language-compatible motion representations for long composite prompts.

What is new is the specific alignment of natural language with pretrained BFMs through that bottleneck, presented as the first such framework. The approach treats the BFM as an executable prior and runs a lightweight generator in the resulting manifold before decoding back to policy latents. This modular split addresses the problem that single end-to-end models struggle with semantic interpretation, long-horizon structure, and low-level realization at once.

The paper does a reasonable job laying out why direct generation methods become costly and unreliable on dense prompts and why operating in a compressed behavioral manifold could improve efficiency and robustness.

The soft spots are straightforward: only the abstract is available, so there are no equations, training objectives, ablation studies, or quantitative results to evaluate. It is impossible to tell whether the bottleneck actually preserves long-horizon structure or whether the claimed performance on compositional text is real. The central assumption that the compressed representations remain compatible with language while keeping behavioral fidelity is stated but untested in what we have.

This is aimed at researchers working on text-to-motion, hierarchical control, or modular animation pipelines. A reader already thinking about latent-space planning for robotics or avatars could extract the architectural pattern.

It deserves peer review because the idea is internally consistent and targets a genuine limitation in current T2M work, even though the current write-up is high-level. Once the full details and experiments are in, a referee can assess whether the bottleneck delivers.

Cheers

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Text2BFM, the first framework to align natural language with pretrained Behavioral Foundation Models (BFMs) for text-to-motion (T2M) generation. It operates in the latent policy space of a frozen BFM as an executable prior, employing a text-aligned variational behavioral bottleneck to compress policy-latent sequences into compact, language-compatible motion representations that preserve long-horizon structure. A lightweight conditional generator performs planning in this manifold, with latents decoded to drive the BFM, thereby decoupling semantic planning from low-level motion execution for improved efficiency and robustness on long, compositional prompts.

Significance. If the text-aligned variational behavioral bottleneck demonstrably preserves long-horizon behavioral structure while maintaining language compatibility, the approach could offer a meaningful contribution to T2M by reducing reliance on end-to-end generators and leveraging frozen BFMs for scalable, robust generation of complex motions. The latent-planning pattern is standard but its application here to BFMs could enable more efficient handling of compositional text if empirically validated.

major comments (1)

Abstract (paragraph describing the framework): The central claim that the text-aligned variational behavioral bottleneck 'compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure' is load-bearing yet unsupported by any equations, implementation details, metrics, ablations, or quantitative results in the manuscript, preventing evaluation of whether the compression actually achieves the stated preservation.

minor comments (1)

Abstract: The term 'Behavioral Foundation Models (BFMs)' is introduced without a reference or prior definition, which may confuse readers unfamiliar with the specific pretrained models used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and the identification of a clarity issue in the abstract. We address the major comment below.

read point-by-point responses

Referee: Abstract (paragraph describing the framework): The central claim that the text-aligned variational behavioral bottleneck 'compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure' is load-bearing yet unsupported by any equations, implementation details, metrics, ablations, or quantitative results in the manuscript, preventing evaluation of whether the compression actually achieves the stated preservation.

Authors: The full manuscript contains the supporting material in Section 3 (Method), which derives the text-aligned variational behavioral bottleneck via an evidence lower bound that jointly enforces compression of policy-latent sequences and alignment to language embeddings, along with the decoding step that recovers executable behaviors from the frozen BFM. Section 4 reports quantitative ablations on bottleneck capacity, language-alignment metrics, and long-horizon coherence scores that directly evaluate preservation of behavioral structure. To make this support immediately visible from the abstract, we will insert a concise parenthetical reference to these sections and the key variational objective in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and framework description introduce Text2BFM as an architectural decoupling of semantic planning from motion execution via a text-aligned variational behavioral bottleneck operating on frozen BFM latents. No equations, fitted parameters, or predictions are presented that reduce by construction to inputs. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The central claim is a design pattern (latent-space planning) with no internal reduction to self-defined quantities or self-referential citations. This is self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only abstract available; ledger populated from stated components.

axioms (1)

domain assumption Pretrained Behavioral Foundation Models exist and can serve as frozen executable motion priors.
The entire pipeline depends on the availability and quality of such BFMs.

invented entities (1)

Text-aligned variational behavioral bottleneck no independent evidence
purpose: Compress BFM policy-latent sequences into compact, language-compatible motion representations while preserving long-horizon structure.
New component introduced to enable the decoupling.

pith-pipeline@v0.9.1-grok · 5754 in / 1206 out tokens · 27012 ms · 2026-06-29T08:48:55.835230+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Proto successor measure: Representing the behavior space of an RL agent

Siddhant Agarwal, Harshit Sikchi, Peter Stone, and Amy Zhang. Proto successor measure: Representing the behavior space of an RL agent. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=mUDnPzopZF

2025
[2]

Language2pose: Natural language grounded pose forecasting

Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In2019 International conference on 3D vision (3DV), pages 719–728. IEEE, 2019

2019
[3]

Albergo, Nicholas M

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations, 2023

2023
[4]

Hunt, Tom Schaul, Hado van Hasselt, and David Silver

André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[5]

Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

work page arXiv 2021
[6]

Maksim Bobrin, Ilya Zisman, Alexander Nikulin, Vladislav Kurenkov, and Dmitry V . Dylov. Zero-shot adaptation of behavioral foundation models to unseen dynamics. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=dBDBg4WF4F

2026
[7]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023

2023
[8]

Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

1993
[9]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025
[10]

Syn- thesis of compositional animations from textual descriptions

Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. Syn- thesis of compositional animations from textual descriptions. InProceedings of the IEEE/CVF international conference on computer vision, pages 1396–1406, 2021

2021
[11]

Hu- manML3D: A large and diverse 3d human motion-language dataset

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Hu- manML3D: A large and diverse 3d human motion-language dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5151–5160, 2022

2022
[12]

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022

2022
[13]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024

1900
[14]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations, 2014

2014
[15]

BFM-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning

Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang Weng, Kris Kitani, Mateusz Guzek, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta, and Guanya Shi. BFM-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. InThe F ourteenth International Conference on Lea...

2026
[16]

Motionllama: A unified framework for motion synthesis and comprehension.arXiv preprint arXiv:2411.17335, 2024

Zeyu Ling, Shunlin Lu, Yuhong Zhang, et al. Motionllama: A unified framework for motion synthesis and comprehension.arXiv preprint arXiv:2411.17335, 2024. 11

work page arXiv 2024
[17]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

2023
[19]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023
[20]

Hierarchical motion generation with diffusion transformers

Ziyi Luo, Hongwen Yang, Xiaogang Wang, and Ziwei Liu. Hierarchical motion generation with diffusion transformers. InAdvances in Neural Information Processing Systems, 2024

2024
[21]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, 2019

2019
[22]

MoMask: Hierarchi- cal masked 3d human motion generation.arXiv preprint arXiv:2312.04561, 2023

Shenghao Mo, Junting Zhang, Yuxiao Guo, Jingbo Wang, and Qifeng Liu. MoMask: Hierarchi- cal masked 3d human motion generation.arXiv preprint arXiv:2312.04561, 2023

work page arXiv 2023
[23]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

2023
[24]

AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40 (4):144:1–144:20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40 (4):144:1–144:20, 2021

2021
[25]

ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics, 41(4):94:1–94:17, 2022

Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics, 41(4):94:1–94:17, 2022

2022
[26]

Black, and Gül Varol

Mathis Petrovich, Michael J. Black, and Gül Varol. TEMOS: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, pages 480–
[27]

Black, and Gül Varol

Mathis Petrovich, Michael J. Black, and Gül Varol. TMR: Text-to-motion retrieval using con- trastive 3d human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9488–9497, 2023

2023
[28]

William Peebles and Saining Xie

Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset. Big Data, 4(4):236–252, 2016. doi: 10.1089/big.2016.0028

work page doi:10.1089/big.2016.0028 2016
[29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763. PMLR, 2021

2021
[30]

Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. HuMoR: 3d human motion model for robust pose estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11488–11499, 2021

2021
[31]

Rempe, M

Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, and Sanja Fidler. Kimodo: Scaling controllable human...

work page arXiv 2026
[32]

Stochastic backpropagation and approximate inference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InInternational Conference on Machine Learning, pages 1278–1286, 2014. 12

2014
[33]

RL zero: Zero-shot language to behaviors without any supervision

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, and Scott Niekum. RL zero: Zero-shot language to behaviors without any supervision. In7th Robot Learning Workshop: Towards Robots with Human-Level Abilities,
[34]

URLhttps://openreview.net/forum?id=wNvuk13MnP
[35]

CALM: Conditional adversarial latent models for directable virtual characters

Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng. CALM: Conditional adversarial latent models for directable virtual characters. InACM SIGGRAPH Conference Proceedings, 2023

2023
[36]

Bermano, and Daniel Cohen-Or

Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, and Daniel Cohen-Or. MotionCLIP: Exposing human motion generation to CLIP space. InEuropean Conference on Computer Vision, pages 358–374. Springer, 2022

2022
[37]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations, 2023

2023
[38]

Zero-shot whole-body humanoid control via behavioral foundation models

Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, and Matteo Pirotta. Zero-shot whole-body humanoid control via behavioral foundation models. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=9sOR0nYLtz

2025
[39]

Learning one representation to optimize all rewards

Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. In Advances in Neural Information Processing Systems, volume 34, pages 13–24, 2021

2021
[40]

Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_OcQI

2023
[41]

arXiv preprint arXiv:2512.23464 (2025)

Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, et al. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

work page arXiv 2025
[42]

PhysDiff: Physics-guided human motion diffusion model

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. PhysDiff: Physics-guided human motion diffusion model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16010–16021, 2023

2023
[43]

Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yang Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14730–14740, 2023

2023
[44]

MotionDiffuse: Text-driven human motion generation with diffusion model

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xiaoyu Guo, Lei Yang, and Ziwei Liu. MotionDiffuse: Text-driven human motion generation with diffusion model. In Advances in Neural Information Processing Systems, volume 35, pages 12987–12999, 2022

2022
[45]

Remodiffuse: Retrieval-augmented motion diffusion model

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Remodiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 13 A Technical Details and Hyperparameters Our method is trained in two stages: 1) semantic laten...

2023
[46]

Stage 2: Text-to-latent generator .A pretrained FB backbone is adapted to predict latent mo- tion sequences conditioned on text

The latent sequence m has stochastic posterior parameterization (µ,logσ 2) and reparameterized sampling. Stage 2: Text-to-latent generator .A pretrained FB backbone is adapted to predict latent mo- tion sequences conditioned on text. Optimization uses composite generation/reconstruction losses (including consistency in bothmandzspaces). Compute resources....

2000

[1] [1]

Proto successor measure: Representing the behavior space of an RL agent

Siddhant Agarwal, Harshit Sikchi, Peter Stone, and Amy Zhang. Proto successor measure: Representing the behavior space of an RL agent. InF orty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=mUDnPzopZF

2025

[2] [2]

Language2pose: Natural language grounded pose forecasting

Chaitanya Ahuja and Louis-Philippe Morency. Language2pose: Natural language grounded pose forecasting. In2019 International conference on 3D vision (3DV), pages 719–728. IEEE, 2019

2019

[3] [3]

Albergo, Nicholas M

Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InInternational Conference on Learning Representations, 2023

2023

[4] [4]

Hunt, Tom Schaul, Hado van Hasselt, and David Silver

André Barreto, Will Dabney, Rémi Munos, Jonathan J. Hunt, Tom Schaul, Hado van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[5] [5]

Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

Léonard Blier, Corentin Tallec, and Yann Ollivier. Learning successor states and goal-dependent values: A mathematical viewpoint.arXiv preprint arXiv:2101.07123, 2021

work page arXiv 2021

[6] [6]

Maksim Bobrin, Ilya Zisman, Alexander Nikulin, Vladislav Kurenkov, and Dmitry V . Dylov. Zero-shot adaptation of behavioral foundation models to unseen dynamics. InThe F ourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=dBDBg4WF4F

2026

[7] [7]

Executing your commands via motion diffusion in latent space

Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18000–18010, 2023

2023

[8] [8]

Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

1993

[9] [9]

Go to zero: Towards zero-shot motion generation with million-scale data

Ke Fan, Shunlin Lu, Minyue Dai, Runyi Yu, Lixing Xiao, Zhiyang Dou, Junting Dong, Lizhuang Ma, and Jingbo Wang. Go to zero: Towards zero-shot motion generation with million-scale data. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13336–13348, 2025

2025

[10] [10]

Syn- thesis of compositional animations from textual descriptions

Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek. Syn- thesis of compositional animations from textual descriptions. InProceedings of the IEEE/CVF international conference on computer vision, pages 1396–1406, 2021

2021

[11] [11]

Hu- manML3D: A large and diverse 3d human motion-language dataset

Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng. Hu- manML3D: A large and diverse 3d human motion-language dataset. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5151–5160, 2022

2022

[12] [12]

Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts

Chuan Guo, Xinxin Zuo, Sen Wang, and Li Cheng. Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. InEuropean Conference on Computer Vision, pages 580–597. Springer, 2022

2022

[13] [13]

Momask: Generative masked modeling of 3d human motions

Chuan Guo, Yuxuan Mu, Muhammad Gohar Javed, Sen Wang, and Li Cheng. Momask: Generative masked modeling of 3d human motions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1900–1910, 2024

1900

[14] [14]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations, 2014

2014

[15] [15]

BFM-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning

Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang Weng, Kris Kitani, Mateusz Guzek, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta, and Guanya Shi. BFM-zero: A promptable behavioral foundation model for humanoid control using unsupervised reinforcement learning. InThe F ourteenth International Conference on Lea...

2026

[16] [16]

Motionllama: A unified framework for motion synthesis and comprehension.arXiv preprint arXiv:2411.17335, 2024

Zeyu Ling, Shunlin Lu, Yuhong Zhang, et al. Motionllama: A unified framework for motion synthesis and comprehension.arXiv preprint arXiv:2411.17335, 2024. 11

work page arXiv 2024

[17] [17]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

2023

[19] [19]

Perpetual humanoid control for real-time simulated avatars

Zhengyi Luo, Jinkun Cao, Kris Kitani, Weipeng Xu, et al. Perpetual humanoid control for real-time simulated avatars. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10895–10904, 2023

2023

[20] [20]

Hierarchical motion generation with diffusion transformers

Ziyi Luo, Hongwen Yang, Xiaogang Wang, and Ziwei Liu. Hierarchical motion generation with diffusion transformers. InAdvances in Neural Information Processing Systems, 2024

2024

[21] [21]

Troje, Gerard Pons-Moll, and Michael J

Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. InInternational Conference on Computer Vision, pages 5442–5451, 2019

2019

[22] [22]

MoMask: Hierarchi- cal masked 3d human motion generation.arXiv preprint arXiv:2312.04561, 2023

Shenghao Mo, Junting Zhang, Yuxiao Guo, Jingbo Wang, and Qifeng Liu. MoMask: Hierarchi- cal masked 3d human motion generation.arXiv preprint arXiv:2312.04561, 2023

work page arXiv 2023

[23] [23]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

2023

[24] [24]

AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40 (4):144:1–144:20, 2021

Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40 (4):144:1–144:20, 2021

2021

[25] [25]

ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics, 41(4):94:1–94:17, 2022

Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. ASE: Large-scale reusable adversarial skill embeddings for physically simulated characters.ACM Transactions on Graphics, 41(4):94:1–94:17, 2022

2022

[26] [26]

Black, and Gül Varol

Mathis Petrovich, Michael J. Black, and Gül Varol. TEMOS: Generating diverse human motions from textual descriptions. InEuropean Conference on Computer Vision, pages 480–

[27] [27]

Black, and Gül Varol

Mathis Petrovich, Michael J. Black, and Gül Varol. TMR: Text-to-motion retrieval using con- trastive 3d human motion synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9488–9497, 2023

2023

[28] [28]

William Peebles and Saining Xie

Matthias Plappert, Christian Mandery, and Tamim Asfour. The KIT motion-language dataset. Big Data, 4(4):236–252, 2016. doi: 10.1089/big.2016.0028

work page doi:10.1089/big.2016.0028 2016

[29] [29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InInterna- tional Conference on Machine Learning, pages 8748–8763. PMLR, 2021

2021

[30] [30]

Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J. Guibas. HuMoR: 3d human motion model for robust pose estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11488–11499, 2021

2021

[31] [31]

Rempe, M

Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song, Olivier Dionne, Jan Kautz, Simon Yuen, and Sanja Fidler. Kimodo: Scaling controllable human...

work page arXiv 2026

[32] [32]

Stochastic backpropagation and approximate inference in deep generative models

Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. InInternational Conference on Machine Learning, pages 1278–1286, 2014. 12

2014

[33] [33]

RL zero: Zero-shot language to behaviors without any supervision

Harshit Sikchi, Siddhant Agarwal, Pranaya Jajoo, Samyak Parajuli, Caleb Chuck, Max Rudolph, Peter Stone, Amy Zhang, and Scott Niekum. RL zero: Zero-shot language to behaviors without any supervision. In7th Robot Learning Workshop: Towards Robots with Human-Level Abilities,

[34] [34]

URLhttps://openreview.net/forum?id=wNvuk13MnP

[35] [35]

CALM: Conditional adversarial latent models for directable virtual characters

Chen Tessler, Yoni Kasten, Yunrong Guo, Shie Mannor, Gal Chechik, and Xue Bin Peng. CALM: Conditional adversarial latent models for directable virtual characters. InACM SIGGRAPH Conference Proceedings, 2023

2023

[36] [36]

Bermano, and Daniel Cohen-Or

Guy Tevet, Brian Gordon, Amir Hertz, Amit H. Bermano, and Daniel Cohen-Or. MotionCLIP: Exposing human motion generation to CLIP space. InEuropean Conference on Computer Vision, pages 358–374. Springer, 2022

2022

[37] [37]

Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H. Bermano. Human motion diffusion model. InInternational Conference on Learning Representations, 2023

2023

[38] [38]

Zero-shot whole-body humanoid control via behavioral foundation models

Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, and Matteo Pirotta. Zero-shot whole-body humanoid control via behavioral foundation models. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=9sOR0nYLtz

2025

[39] [39]

Learning one representation to optimize all rewards

Ahmed Touati and Yann Ollivier. Learning one representation to optimize all rewards. In Advances in Neural Information Processing Systems, volume 34, pages 13–24, 2021

2021

[40] [40]

Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023

Ahmed Touati, Jérémy Rapin, and Yann Ollivier. Does zero-shot reinforcement learning exist? InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MYEap_OcQI

2023

[41] [41]

arXiv preprint arXiv:2512.23464 (2025)

Yuxin Wen, Qing Shuai, Di Kang, Jing Li, Cheng Wen, Yue Qian, Ningxin Jiao, Changhai Chen, Weijie Chen, Yiran Wang, et al. Hy-motion 1.0: Scaling flow matching models for text-to-motion generation.arXiv preprint arXiv:2512.23464, 2025

work page arXiv 2025

[42] [42]

PhysDiff: Physics-guided human motion diffusion model

Ye Yuan, Jiaming Song, Umar Iqbal, Arash Vahdat, and Jan Kautz. PhysDiff: Physics-guided human motion diffusion model. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16010–16021, 2023

2023

[43] [43]

Generating human motion from textual descriptions with discrete representations

Jianrong Zhang, Yang Zhang, Xiaodong Cun, Yong Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen, and Ying Shan. Generating human motion from textual descriptions with discrete representations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14730–14740, 2023

2023

[44] [44]

MotionDiffuse: Text-driven human motion generation with diffusion model

Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xiaoyu Guo, Lei Yang, and Ziwei Liu. MotionDiffuse: Text-driven human motion generation with diffusion model. In Advances in Neural Information Processing Systems, volume 35, pages 12987–12999, 2022

2022

[45] [45]

Remodiffuse: Retrieval-augmented motion diffusion model

Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, and Ziwei Liu. Remodiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 364–373, 2023. 13 A Technical Details and Hyperparameters Our method is trained in two stages: 1) semantic laten...

2023

[46] [46]

Stage 2: Text-to-latent generator .A pretrained FB backbone is adapted to predict latent mo- tion sequences conditioned on text

The latent sequence m has stochastic posterior parameterization (µ,logσ 2) and reparameterized sampling. Stage 2: Text-to-latent generator .A pretrained FB backbone is adapted to predict latent mo- tion sequences conditioned on text. Optimization uses composite generation/reconstruction losses (including consistency in bothmandzspaces). Compute resources....

2000