pith. sign in

arxiv: 2604.11251 · v3 · submitted 2026-04-13 · 💻 cs.RO

CLAW: Composable Language-Annotated Whole-body Motion Generation

Pith reviewed 2026-05-10 15:12 UTC · model grok-4.3

classification 💻 cs.RO
keywords whole-body motion generationlanguage-conditioned controlhumanoid robotsmotion-language datasetsphysics simulationkinematic planningdataset generationtemplate annotation
0
0 comments X

The pith

CLAW generates large-scale language-annotated whole-body motion data for humanoid robots by composing kinematic primitives and simulating physical trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CLAW as a pipeline to produce extensive paired datasets of whole-body motions and natural language descriptions needed for training language-conditioned controllers on humanoid robots such as the Unitree G1. It addresses limitations of motion capture by using composable primitives from a kinematic planner and browser interfaces for collection, then applies a low-level controller in MuJoCo simulation to create physically realistic trajectories. A template engine simultaneously generates diverse language annotations for individual segments and full sequences. This combination aims to deliver scalable, grounded data without the expense or physical infeasibility of prior approaches. A sympathetic reader would care because such datasets could enable more effective learning of language-guided humanoid behaviors at lower cost.

Core claim

CLAW is a pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. It composes motion primitives from a kinematic planner parameterized by movement, heading, speed, pelvis height, and duration, provides browser-based keyboard and timeline interfaces for data collection, tracks the references with a low-level controller in MuJoCo simulation to produce physically grounded trajectories, and uses a template-based engine to generate diverse natural-language annotations at both segment and trajectory levels.

What carries the argument

The CLAW pipeline that composes parameterized kinematic motion primitives, tracks them via low-level control in physics simulation for physical grounding, and pairs the results with template-generated language annotations.

If this is right

  • The system can produce motion-language pairs at scale without requiring motion-capture hardware.
  • The resulting data supports training of language-conditioned whole-body controllers for humanoid robots.
  • Browser interfaces enable both interactive exploration and batch generation of trajectories.
  • Physics simulation ensures trajectories respect stability and contact constraints before annotation.
  • Public release of the pipeline allows community extension for additional robot platforms or tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adapting the planner and controller could extend the approach to other humanoid hardware with minimal redesign.
  • Replacing or augmenting templates with learned language models might increase annotation variety beyond fixed structures.
  • The generated pairs could serve as pretraining data for models that map language directly to low-level robot commands.

Load-bearing premise

The low-level controller in MuJoCo will successfully track the kinematic references to produce stable, physically feasible trajectories without falling or violating contact constraints, while the template engine will generate sufficiently diverse and natural language annotations at scale.

What would settle it

A high frequency of simulation rollouts in which the robot falls, violates joint limits, or loses contact stability would show that the generated trajectories are not physically grounded as claimed.

Figures

Figures reproduced from arXiv: 2604.11251 by Jianuo Cao, Masayoshi Tomizuka, Yuxin Chen.

Figure 1
Figure 1. Figure 1: Overview of CLAW. Users can compose whole-body motion sequences for the Unitree G1 humanoid through either keyboard mode (top-left) or editor mode (top-right), producing diverse trajectories with seamless transitions between motion modes (bottom-left). Each trajectory is automatically paired with multi-style natural-language descriptions (bottom-right). Abstract— Training language-conditioned whole-body co… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the CLAW data generation pipeline. A human operator specifies motion intent via a keyboard controller or sequence editor. Structured commands are streamed to a kinematic planner whose reference motions are tracked by a whole-body controller in MuJoCo simulation. A template-based annotation engine simultaneously produces diverse natural￾language descriptions from the same motion parameters, yiel… view at source ↗
Figure 3
Figure 3. Figure 3: Two complementary control interfaces of CLAW. (a) In keyboard mode, the operator controls the robot interactively using key bindings for motion mode, movement, heading, speed, pelvis height and duration. (b) In editor mode, motion clips are arranged on a visual timeline with per-segment configuration, enabling reproducible large-batch data generation. ister: instruction (imperative, e.g. “Walk forward for … view at source ↗
Figure 4
Figure 4. Figure 4: Illustrative examples of CLAW generation capabilities. (a) The pipeline produces diverse whole-body motions spanning locomotion, squatting, boxing, and styled walking. (b) For a given motion mode, parameters such as velocity and heading can be continuously adjusted. (c) Each generated motion sequence is automatically annotated with multiple stylistically varied natural-language descriptions [PITH_FULL_IMA… view at source ↗
Figure 5
Figure 5. Figure 5: System architecture of CLAW. The pipeline comprises four decoupled processes: a MuJoCo simulation, a C++ controller hosting the kinematic planner and tracking policy, a WebSocket–ZMQ bridge for protocol translation and recipe orchestration, and a browser-based frontend for visualization and operator interaction. Walk Kneel(One) (a) Successful transition Boxing Elbow Crawl (b) Failure transition [PITH_FULL… view at source ↗
Figure 6
Figure 6. Figure 6: Motion stitching. CLAW enables motion stitching across semantically distinct motion modes. (a) The gener￾ative planner produces smooth, natural transitions between different motion modes. (b) Abrupt mode switching might lead to transient artifacts in the generated motion. uses the current robot state as its starting point whenever the active mode changes, producing smooth transitions without any explicit b… view at source ↗
Figure 7
Figure 7. Figure 7: Multi-stage trajectory. Example of automated language annotation for a long-horizon trajectory with multiple stages. for the Unitree G1 humanoid robot. By treating the mo￾tion modes of a kinematic planner as composable prim￾itives, CLAW enables the construction of diverse, long￾horizon trajectory sequences through simple parameteriza￾tion. A template-based annotation engine produces eight per-segment and s… view at source ↗
read the original abstract

Training language-conditioned whole-body controllers for humanoid robots demands large-scale motion-language datasets. Existing approaches based on motion capture are costly and limited in diversity, while text-to-motion generative models produce purely kinematic outputs that are not guaranteed to be physically feasible. We present CLAW, a pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW composes motion primitives from a kinematic planner, parameterized by movement, heading, speed, pelvis height, and duration, and provides two browser-based interfaces--a real-time keyboard mode and a timeline-based sequence editor--for exploratory and batch data collection. A low-level controller tracks these references in MuJoCo simulation, yielding physically grounded trajectories. In parallel, a template-based engine generates diverse natural-language annotations at both segment and trajectory levels. To support scalable generation of motion-language paired data for humanoid robot learning, we make our system publicly available at: https://github.com/JianuoCao/CLAW

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents CLAW, an open-source pipeline for scalable generation of language-annotated whole-body motion data for the Unitree G1 humanoid robot. CLAW composes motion primitives from a kinematic planner parameterized by movement, heading, speed, pelvis height, and duration; offers two browser-based interfaces (real-time keyboard mode and timeline-based sequence editor) for data collection; tracks the resulting references with a low-level controller in MuJoCo simulation to produce physically grounded trajectories; and employs a template-based engine to generate natural-language annotations at both segment and trajectory levels.

Significance. If the described components function as specified, CLAW supplies a practical, reproducible tool for creating large-scale motion-language datasets that are both diverse and physically feasible, directly addressing the high cost and limited diversity of motion-capture methods as well as the physical infeasibility of purely kinematic generative models. The public GitHub release, browser interfaces, and explicit parameterization of primitives constitute clear strengths for accessibility and extensibility in humanoid robot learning research.

major comments (1)
  1. [Abstract and low-level controller description] Abstract and low-level controller description: The central claim that the low-level controller 'yields physically grounded trajectories' is not accompanied by any reported tracking success rates, error statistics, stability metrics, or failure modes across generated motions. This evidence is load-bearing for the physical-feasibility guarantee that distinguishes the pipeline from purely kinematic approaches.
minor comments (2)
  1. [Motion primitive parameterization] Motion primitive parameterization: The free parameters (movement, heading, speed, pelvis height, duration) are enumerated but the manuscript does not specify their allowable ranges, sampling distributions, or composition constraints, which would aid reproducibility of the claimed scalability.
  2. [Annotation engine] Annotation engine: The template-based language generation is described at a high level; concrete examples of segment-level versus trajectory-level annotations and any post-processing for naturalness would clarify the diversity claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation for minor revision. We appreciate the recognition of CLAW's value as a practical tool for generating diverse, physically feasible motion-language data. We address the major comment below.

read point-by-point responses
  1. Referee: The central claim that the low-level controller 'yields physically grounded trajectories' is not accompanied by any reported tracking success rates, error statistics, stability metrics, or failure modes across generated motions. This evidence is load-bearing for the physical-feasibility guarantee that distinguishes the pipeline from purely kinematic approaches.

    Authors: We agree that quantitative tracking metrics would strengthen the manuscript's distinction from purely kinematic methods. The current description relies on the use of MuJoCo's physics engine for reference tracking to ensure physical feasibility, but we acknowledge the absence of explicit success rates, error statistics, or failure mode analysis in the abstract and low-level controller section. In the revised version, we will add a dedicated paragraph (or short subsection) under the low-level controller description that reports aggregate tracking performance across the generated dataset, including average joint-position and pelvis-tracking RMSE, the fraction of trajectories that complete without falling or excessive deviation, and a brief discussion of common failure modes (e.g., when commanded speeds exceed actuator limits). These additions will be supported by example plots or tables and will directly address the load-bearing nature of the physical-grounding claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an engineering pipeline for composing kinematic motion primitives, tracking them in MuJoCo simulation, and generating template-based language annotations. No mathematical derivations, predictions, fitted parameters, or first-principles results are claimed. The contribution is the release of the system itself, whose validity is directly testable via the linked repository rather than reduced to any self-referential quantity or self-citation chain. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The pipeline rests on the domain assumption that MuJoCo provides sufficiently accurate physics for the G1 to produce 'physically grounded' trajectories when tracking kinematic references, and that template-based generation yields diverse natural language without additional learned models.

free parameters (1)
  • movement, heading, speed, pelvis height, duration
    These are user-specified parameters for each primitive; they are design inputs rather than fitted constants but directly determine the generated motions.
axioms (2)
  • domain assumption MuJoCo simulation accurately models the dynamics and contact physics of the Unitree G1 humanoid sufficiently for reference tracking to produce feasible motions.
    Invoked when stating that the low-level controller yields physically grounded trajectories.
  • domain assumption Template-based language generation produces diverse and natural annotations at segment and trajectory levels.
    Central to the claim of scalable language-annotated data.

pith-pipeline@v0.9.0 · 5474 in / 1643 out tokens · 33336 ms · 2026-05-10T15:12:10.823214+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Perpetual humanoid control for real-time simulated avatars,

    Z. Luo, J. Cao, A. Weng, K. Kitani, and W. Xu, “Perpetual humanoid control for real-time simulated avatars,” inInternational Conference on Computer Vision (ICCV), 2023

  2. [2]

    DeepMimic: Example-guided deep reinforcement learning of physics-based character skills,

    X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “DeepMimic: Example-guided deep reinforcement learning of physics-based character skills,”ACM Transactions on Graphics (Proc. SIGGRAPH), vol. 37, no. 4, 2018

  3. [3]

    Retargeting matters: General motion retargeting for humanoid motion tracking,

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu, “Retargeting matters: General motion retargeting for humanoid motion tracking,”

  4. [4]

    Karen Liu

    [Online]. Available: https://arxiv.org/abs/2510.02252

  5. [5]

    AMASS: Archive of motion capture as surface shapes,

    N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black, “AMASS: Archive of motion capture as surface shapes,” in International Conference on Computer Vision (ICCV), 2019

  6. [6]

    arXiv preprint arXiv:2603.15546 , year=

    D. Rempe, M. Petrovich, Y . Yuan, H. Zhang, X. B. Peng, Y . Jiang, T. Wang, U. Iqbal, D. Minor, M. de Ruyter, J. Li, C. Tessler, E. Lim, E. Jeong, S. Wu, E. Hassani, M. Huang, J.-B. Yu, C. Chung, L. Song, O. Dionne, J. Kautz, S. Yuen, and S. Fidler, “Kimodo: Scaling control- lable human motion generation,”arXiv:2603.15546, 2026

  7. [7]

    Sonic: Supersizing motion tracking for natural humanoid whole-body control.arXiv preprint arXiv:2511.07820, 2025

    Z. Luo, Y . Yuan, T. Wang, C. Li, S. Chen, F. Casta ˜neda, Z.-A. Cao, J. Li, D. Minor, Q. Ben, X. Da, R. Ding, C. Hogg, L. Song, E. Lim, E. Jeong, T. He, H. Xue, W. Xiao, Z. Wang, S. Yuen, J. Kautz, Y . Chang, U. Iqbal, L. J. Fan, and Y . Zhu, “Sonic: Supersizing motion tracking for natural humanoid whole-body control,” 2025. [Online]. Available: https://...