CLAW: Composable Language-Annotated Whole-body Motion Generation

Jianuo Cao; Masayoshi Tomizuka; Yuxin Chen

arxiv: 2604.11251 · v3 · submitted 2026-04-13 · 💻 cs.RO

CLAW: Composable Language-Annotated Whole-body Motion Generation

Jianuo Cao , Yuxin Chen , Masayoshi Tomizuka This is my paper

Pith reviewed 2026-05-10 15:12 UTC · model grok-4.3

classification 💻 cs.RO

keywords whole-body motion generationlanguage-conditioned controlhumanoid robotsmotion-language datasetsphysics simulationkinematic planningdataset generationtemplate annotation

0 comments

The pith

CLAW generates large-scale language-annotated whole-body motion data for humanoid robots by composing kinematic primitives and simulating physical trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLAW as a pipeline to produce extensive paired datasets of whole-body motions and natural language descriptions needed for training language-conditioned controllers on humanoid robots such as the Unitree G1. It addresses limitations of motion capture by using composable primitives from a kinematic planner and browser interfaces for collection, then applies a low-level controller in MuJoCo simulation to create physically realistic trajectories. A template engine simultaneously generates diverse language annotations for individual segments and full sequences. This combination aims to deliver scalable, grounded data without the expense or physical infeasibility of prior approaches. A sympathetic reader would care because such datasets could enable more effective learning of language-guided humanoid behaviors at lower cost.

Core claim

CLAW is a pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. It composes motion primitives from a kinematic planner parameterized by movement, heading, speed, pelvis height, and duration, provides browser-based keyboard and timeline interfaces for data collection, tracks the references with a low-level controller in MuJoCo simulation to produce physically grounded trajectories, and uses a template-based engine to generate diverse natural-language annotations at both segment and trajectory levels.

What carries the argument

The CLAW pipeline that composes parameterized kinematic motion primitives, tracks them via low-level control in physics simulation for physical grounding, and pairs the results with template-generated language annotations.

If this is right

The system can produce motion-language pairs at scale without requiring motion-capture hardware.
The resulting data supports training of language-conditioned whole-body controllers for humanoid robots.
Browser interfaces enable both interactive exploration and batch generation of trajectories.
Physics simulation ensures trajectories respect stability and contact constraints before annotation.
Public release of the pipeline allows community extension for additional robot platforms or tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adapting the planner and controller could extend the approach to other humanoid hardware with minimal redesign.
Replacing or augmenting templates with learned language models might increase annotation variety beyond fixed structures.
The generated pairs could serve as pretraining data for models that map language directly to low-level robot commands.

Load-bearing premise

The low-level controller in MuJoCo will successfully track the kinematic references to produce stable, physically feasible trajectories without falling or violating contact constraints, while the template engine will generate sufficiently diverse and natural language annotations at scale.

What would settle it

A high frequency of simulation rollouts in which the robot falls, violates joint limits, or loses contact stability would show that the generated trajectories are not physically grounded as claimed.

Figures

Figures reproduced from arXiv: 2604.11251 by Jianuo Cao, Masayoshi Tomizuka, Yuxin Chen.

**Figure 1.** Figure 1: Overview of CLAW. Users can compose whole-body motion sequences for the Unitree G1 humanoid through either keyboard mode (top-left) or editor mode (top-right), producing diverse trajectories with seamless transitions between motion modes (bottom-left). Each trajectory is automatically paired with multi-style natural-language descriptions (bottom-right). Abstract— Training language-conditioned whole-body co… view at source ↗

**Figure 2.** Figure 2: Overview of the CLAW data generation pipeline. A human operator specifies motion intent via a keyboard controller or sequence editor. Structured commands are streamed to a kinematic planner whose reference motions are tracked by a whole-body controller in MuJoCo simulation. A template-based annotation engine simultaneously produces diverse naturallanguage descriptions from the same motion parameters, yiel… view at source ↗

**Figure 3.** Figure 3: Two complementary control interfaces of CLAW. (a) In keyboard mode, the operator controls the robot interactively using key bindings for motion mode, movement, heading, speed, pelvis height and duration. (b) In editor mode, motion clips are arranged on a visual timeline with per-segment configuration, enabling reproducible large-batch data generation. ister: instruction (imperative, e.g. “Walk forward for … view at source ↗

**Figure 4.** Figure 4: Illustrative examples of CLAW generation capabilities. (a) The pipeline produces diverse whole-body motions spanning locomotion, squatting, boxing, and styled walking. (b) For a given motion mode, parameters such as velocity and heading can be continuously adjusted. (c) Each generated motion sequence is automatically annotated with multiple stylistically varied natural-language descriptions [PITH_FULL_IMA… view at source ↗

**Figure 5.** Figure 5: System architecture of CLAW. The pipeline comprises four decoupled processes: a MuJoCo simulation, a C++ controller hosting the kinematic planner and tracking policy, a WebSocket–ZMQ bridge for protocol translation and recipe orchestration, and a browser-based frontend for visualization and operator interaction. Walk Kneel(One) (a) Successful transition Boxing Elbow Crawl (b) Failure transition [PITH_FULL… view at source ↗

**Figure 6.** Figure 6: Motion stitching. CLAW enables motion stitching across semantically distinct motion modes. (a) The generative planner produces smooth, natural transitions between different motion modes. (b) Abrupt mode switching might lead to transient artifacts in the generated motion. uses the current robot state as its starting point whenever the active mode changes, producing smooth transitions without any explicit b… view at source ↗

**Figure 7.** Figure 7: Multi-stage trajectory. Example of automated language annotation for a long-horizon trajectory with multiple stages. for the Unitree G1 humanoid robot. By treating the motion modes of a kinematic planner as composable primitives, CLAW enables the construction of diverse, longhorizon trajectory sequences through simple parameterization. A template-based annotation engine produces eight per-segment and s… view at source ↗

read the original abstract

Training language-conditioned whole-body controllers for humanoid robots demands large-scale motion-language datasets. Existing approaches based on motion capture are costly and limited in diversity, while text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible. We present CLAW, a pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW composes motion primitives from a kinematic planner, parameterized by movement, heading, speed, pelvis height, and duration, and provides two browser-based interfaces--a real-time keyboard mode and a timeline-based sequence editor--for exploratory and batch data collection. A low-level controller tracks these references in MuJoCo simulation, yielding physically grounded trajectories. In parallel, a template-based engine generates diverse natural-language annotations at both segment and trajectory levels. To support scalable generation of motion-language paired data for humanoid robot learning, we make our system publicly available at: https://github.com/JianuoCao/CLAW

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLAW is a usable open-source pipeline for generating language-annotated humanoid motion data on the Unitree G1, but it lacks the quantitative checks needed to show it actually scales well.

read the letter

The paper's core contribution is a practical engineering pipeline called CLAW that composes kinematic motion primitives, lets users edit sequences through browser interfaces, tracks them in MuJoCo for physical grounding, and pairs the results with template-generated language labels at both segment and full-trajectory levels. The GitHub release is the real deliverable here, and it directly targets the data shortage for language-conditioned humanoid controllers on one specific robot platform. That combination of composable primitives, dual interfaces, simulation tracking, and annotation engine is not already out there in the cited literature, so the work is new in its integrated form even if the individual pieces are familiar techniques. Releasing the code makes the system inspectable and reusable, which is the strongest part of the submission. Researchers who need to bootstrap datasets for Unitree G1 experiments will get immediate value from the interfaces and the template engine. The description is internally consistent and the assumptions are testable by running the provided repository. The main soft spot is the absence of any numbers on tracking success rates, fall frequency, contact stability, or annotation diversity and naturalness. Without those metrics the claim that the pipeline is scalable stays unverified, and readers cannot tell how much manual cleanup the outputs still require. The scope is also narrow to a single robot, so generalization is not demonstrated. This is the sort of systems paper that belongs in a robotics venue focused on tools and datasets rather than new algorithms. It deserves peer review because the code exists, the problem is real, and the integration is clean, even though the authors will likely need to add validation experiments and broader testing before acceptance. I would bring it to a reading group for the code and interfaces but would not cite it unless I were actually using the generated data.

Referee Report

1 major / 2 minor

Summary. The manuscript presents CLAW, an open-source pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW composes motion primitives from a kinematic planner parameterized by movement, heading, speed, pelvis height, and duration; offers two browser-based interfaces (real-time keyboard mode and timeline-based sequence editor) for data collection; tracks the resulting references with a low-level controller in MuJoCo simulation to produce physically grounded trajectories; and employs a template-based engine to generate natural-language annotations at both segment and trajectory levels.

Significance. If the described components function as specified, CLAW supplies a practical, reproducible tool for creating large-scale motion-language datasets that are both diverse and physically feasible, directly addressing the high cost and limited diversity of motion-capture methods as well as the physical infeasibility of purely kinematic generative models. The public GitHub release, browser interfaces, and explicit parameterization of primitives constitute clear strengths for accessibility and extensibility in humanoid robot learning research.

major comments (1)

[Abstract and low-level controller description] Abstract and low-level controller description: The central claim that the low-level controller 'yields physically grounded trajectories' is not accompanied by any reported tracking success rates, error statistics, stability metrics, or failure modes across generated motions. This evidence is load-bearing for the physical-feasibility guarantee that distinguishes the pipeline from purely kinematic approaches.

minor comments (2)

[Motion primitive parameterization] Motion primitive parameterization: The free parameters (movement, heading, speed, pelvis height, duration) are enumerated but the manuscript does not specify their allowable ranges, sampling distributions, or composition constraints, which would aid reproducibility of the claimed scalability.
[Annotation engine] Annotation engine: The template-based language generation is described at a high level; concrete examples of segment-level versus trajectory-level annotations and any post-processing for naturalness would clarify the diversity claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We appreciate the recognition of CLAW's value as a practical tool for generating diverse, physically feasible motion-language data. We address the major comment below.

read point-by-point responses

Referee: The central claim that the low-level controller 'yields physically grounded trajectories' is not accompanied by any reported tracking success rates, error statistics, stability metrics, or failure modes across generated motions. This evidence is load-bearing for the physical-feasibility guarantee that distinguishes the pipeline from purely kinematic approaches.

Authors: We agree that quantitative tracking metrics would strengthen the manuscript's distinction from purely kinematic methods. The current description relies on the use of MuJoCo's physics engine for reference tracking to ensure physical feasibility, but we acknowledge the absence of explicit success rates, error statistics, or failure mode analysis in the abstract and low-level controller section. In the revised version, we will add a dedicated paragraph (or short subsection) under the low-level controller description that reports aggregate tracking performance across the generated dataset, including average joint-position and pelvis-tracking RMSE, the fraction of trajectories that complete without falling or excessive deviation, and a brief discussion of common failure modes (e.g., when commanded speeds exceed actuator limits). These additions will be supported by example plots or tables and will directly address the load-bearing nature of the physical-grounding claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering pipeline for composing kinematic motion primitives, tracking them in MuJoCo simulation, and generating template-based language annotations. No mathematical derivations, predictions, fitted parameters, or first-principles results are claimed. The contribution is the release of the system itself, whose validity is directly testable via the linked repository rather than reduced to any self-referential quantity or self-citation chain. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The pipeline rests on the domain assumption that MuJoCo provides sufficiently accurate physics for the G1 to produce 'physically grounded' trajectories when tracking kinematic references, and that template-based generation yields diverse natural language without additional learned models.

free parameters (1)

movement, heading, speed, pelvis height, duration
These are user-specified parameters for each primitive; they are design inputs rather than fitted constants but directly determine the generated motions.

axioms (2)

domain assumption MuJoCo simulation accurately models the dynamics and contact physics of the Unitree G1 humanoid sufficiently for reference tracking to produce feasible motions.
Invoked when stating that the low-level controller yields physically grounded trajectories.
domain assumption Template-based language generation produces diverse and natural annotations at segment and trajectory levels.
Central to the claim of scalable language-annotated data.

pith-pipeline@v0.9.0 · 5474 in / 1643 out tokens · 33336 ms · 2026-05-10T15:12:10.823214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Perpetual humanoid control for real-time simulated avatars,

Z. Luo, J. Cao, A. Weng, K. Kitani, and W. Xu, “Perpetual humanoid control for real-time simulated avatars,” inInternational Conference on Computer Vision (ICCV), 2023

work page 2023
[2]

DeepMimic: Example-guided deep reinforcement learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “DeepMimic: Example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions on Graphics (Proc. SIGGRAPH), vol. 37, no. 4, 2018

work page 2018
[3]

Retargeting matters: General motion retargeting for humanoid motion tracking,

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu, “Retargeting matters: General motion retargeting for humanoid motion tracking,”

work page
[4]

Karen Liu

[Online]. Available: https://arxiv.org/abs/2510.02252

work page arXiv
[5]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” in International Conference on Computer Vision (ICCV), 2019

work page 2019
[6]

arXiv preprint arXiv:2603.15546 , year=

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, J. Li, C. Tessler, E. Lim, E. Jeong, S. Wu, E. Hassani, M. Huang, J.-B. Yu, C. Chung, L. Song, O. Dionne, J. Kautz, S. Yuen, and S. Fidler, “Kimodo: Scaling control- lable human motion generation,”arXiv:2603.15546, 2026

work page arXiv 2026
[7]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. J. Fan, and Y . Zhu, “Sonic: Supersizing motion tracking for natural humanoid whole-body control,” 2025. [Online]. Available: https://...

work page internal anchor Pith review arXiv 2025

[1] [1]

Perpetual humanoid control for real-time simulated avatars,

Z. Luo, J. Cao, A. Weng, K. Kitani, and W. Xu, “Perpetual humanoid control for real-time simulated avatars,” inInternational Conference on Computer Vision (ICCV), 2023

work page 2023

[2] [2]

DeepMimic: Example-guided deep reinforcement learning of physics-based character skills,

X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “DeepMimic: Example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions on Graphics (Proc. SIGGRAPH), vol. 37, no. 4, 2018

work page 2018

[3] [3]

Retargeting matters: General motion retargeting for humanoid motion tracking,

J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu, “Retargeting matters: General motion retargeting for humanoid motion tracking,”

work page

[4] [4]

Karen Liu

[Online]. Available: https://arxiv.org/abs/2510.02252

work page arXiv

[5] [5]

AMASS: Archive of motion capture as surface shapes,

N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” in International Conference on Computer Vision (ICCV), 2019

work page 2019

[6] [6]

arXiv preprint arXiv:2603.15546 , year=

D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, J. Li, C. Tessler, E. Lim, E. Jeong, S. Wu, E. Hassani, M. Huang, J.-B. Yu, C. Chung, L. Song, O. Dionne, J. Kautz, S. Yuen, and S. Fidler, “Kimodo: Scaling control- lable human motion generation,”arXiv:2603.15546, 2026

work page arXiv 2026

[7] [7]

Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. J. Fan, and Y . Zhu, “Sonic: Supersizing motion tracking for natural humanoid whole-body control,” 2025. [Online]. Available: https://...

work page internal anchor Pith review arXiv 2025