pith. machine review for the scientific record. sign in

arxiv: 2605.14417 · v1 · submitted 2026-05-14 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:11 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords language-conditioned humanoid controlanticipatory joint intentdiffusion action policystreaming whole-body controldynamics alignmentteacher-student distillationfuture intent generation
0
0 comments X

The pith

DAJI learns dynamics-aligned joint intents so language commands produce humanoid actions that are both immediate and anticipatory of future contacts and balance shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace reactive repair in language-driven humanoid control with an explicit anticipatory interface. It claims that a hierarchical structure can generate joint intents that encode upcoming physical transitions while remaining executable in closed loop. The approach uses a teacher-student distillation step to transfer future awareness into a deployable diffusion policy and an autoregressive flow model to produce future intent chunks from language and history. A reader would care because current kinematic or latent methods leave the low-level controller to handle transitions after they arise, which limits fluid streaming performance. If the claim holds, language becomes a more reliable interface for whole-body tasks that require preparation before movement begins.

Core claim

DAJI is a hierarchical framework whose DAJI-Act component distills a future-aware teacher policy into a deployable diffusion action policy through student-driven rollouts, while its DAJI-Flow component autoregressively generates future intent chunks conditioned on language and prior intent. This joint-intent layer sits between language generation and closed-loop control and explicitly encodes upcoming contact changes, support transfers, and balance preparation. Experiments report 94.42% rollout success on HumanML3D-style generation tasks and 0.152 subsequence FID on BABEL streaming benchmarks.

What carries the argument

Dynamics-Aligned Joint Intent (DAJI), a hierarchical anticipatory interface that encodes future contact and balance information between language inputs and low-level control outputs.

If this is right

  • Enables language-conditioned generation of whole-body actions that prepare for upcoming contacts rather than repairing after the fact.
  • Supports streaming instruction following at 0.152 subsequence FID while maintaining 94.42% rollout success on standard benchmarks.
  • Provides an explicit latent representation of future joint states that low-level trackers can use without reactive correction.
  • Allows single-instruction and multi-turn language commands to produce coherent sequences of anticipatory intents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation-plus-autoregressive structure could be tested on non-humanoid platforms where anticipatory contact planning matters, such as wheeled or legged mobile manipulators.
  • If the joint-intent layer proves stable, it could reduce reliance on separate kinematic reference generators that later require low-level repair.
  • Extending the intent chunks to include visual or tactile predictions would be a direct next measurement of whether the anticipatory signal generalizes beyond motion only.

Load-bearing premise

Distilling a future-aware teacher policy into a deployable diffusion action policy via student-driven rollouts preserves the anticipatory properties without substantial degradation in closed-loop performance.

What would settle it

A closed-loop evaluation that shows substantially lower rollout success or markedly higher subsequence FID when the distilled student policy is used instead of the teacher would indicate that the distillation step fails to retain anticipatory capability.

Figures

Figures reproduced from arXiv: 2605.14417 by Haozhe Jia, Honglei Jin, Jianfei Song, Kuimou Yu, Lei Wang, Shaofeng Liang, Shuxu Jin, Wenshuo Chen, Youcheng Fan, Yuan Zhang, Yutao Yue, Zinuo Zhang.

Figure 1
Figure 1. Figure 1: Teaser of DAJI. Instead of using reference trajectories as the deployment interface, DAJI predicts executable and anticipatory joint-intent latents that improve future prediction and long-horizon humanoid control. 1–20 arXiv:2605.14417v1 [cs.RO] 14 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DAJI framework. DAJI separates online deployment and offline training. DAJI-Flow predicts joint-intent latents from language and latent history, while DAJI-Act decodes each intent with live proprioception. DAJI-Act learns the executable joint-intent interface through student-driven in-the-loop distillation from a future-aware privileged teacher. learns the interface between the two modules: DAJI-Flow gener… view at source ↗
Figure 3
Figure 3. Figure 3: Tracker-level rollout visualization in simulation. DAJI decodes generated joint-intent latents into continuous whole-body humanoid motions, including dynamic and highly articulated behaviors. where 𝛼̄ 𝜏 is the cumulative noise schedule. Conditioned on 𝐜𝑡 , a lightweight denoiser 𝒟𝜙 predicts the clean action. The DAJI-Act objective is ℒAct = 𝔼(𝐨𝑡 ,𝐚 tea 𝑡 )∼𝒟student, 𝜏,𝝐𝑎,𝝐𝑧 [ ‖‖‖‖ 𝐚 tea 𝑡 − 𝒟𝜙 (𝐱𝜏 , 𝜏, 𝐜𝑡 … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative deployment results on physical humanoid hardware. DAJI produces executable motions under both streaming instruction switches and single-instruction generation. Any object-related phrases shown in qualitative prompts are interpreted only as body-motion descriptions; no object state or manipulation outcome is modeled or evaluated. 4.4. Main Benchmarks 4.4.1. HumanML3D-Style Robot Motion Generatio… view at source ↗
Figure 5
Figure 5. Figure 5: Sim (MuJoCo) Validation Results: Robust tracking performance on simple gestures to complex maneuvers. Any prompt phrase is interpreted only as a body-motion description; no object state or manipulation outcome is modeled or evaluated. 18–20 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
read the original abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes DAJI (Dynamics-Aligned Joint Intent), a hierarchical framework for language-conditioned humanoid control. DAJI-Act distills a future-aware teacher policy into a deployable diffusion action policy via student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments claim strong performance in anticipatory latent learning, single-instruction generation, and streaming instruction following, with 94.42% rollout success on HumanML3D-style tasks and 0.152 subsequence FID on BABEL.

Significance. If the distillation successfully transfers anticipatory properties to the closed-loop diffusion policy, the work could advance language-conditioned whole-body control by enabling proactive handling of contact changes, support transfers, and balance preparation rather than relying on reactive low-level repairs. The separation of future-intent generation from executable action diffusion offers a scalable interface for streaming humanoid tasks.

major comments (1)
  1. [§3.2] §3.2 (DAJI-Act distillation procedure): the student-driven rollout objective contains no explicit future-horizon loss, teacher-student anticipation alignment term, or future-state prediction error regularizer. This is load-bearing for the central claim that the deployable diffusion policy encodes upcoming contact changes and balance preparation, because without such a term the policy can achieve high rollout success while collapsing to reactive behavior. Please add a quantitative alignment metric between teacher and student future predictions and report its value in the results.
minor comments (2)
  1. [Abstract] Abstract: quantitative claims (94.42% success, 0.152 FID) are stated without naming the baselines, ablation variants, or error-bar statistics; a one-sentence reference to the comparison methods would improve readability.
  2. [§2] Notation: the distinction between 'joint intent' and 'action' chunks is introduced without a compact equation or diagram in the early sections; adding a small schematic would clarify the hierarchical interface.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestion regarding the distillation procedure. We address the major comment below and outline the planned revision.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (DAJI-Act distillation procedure): the student-driven rollout objective contains no explicit future-horizon loss, teacher-student anticipation alignment term, or future-state prediction error regularizer. This is load-bearing for the central claim that the deployable diffusion policy encodes upcoming contact changes and balance preparation, because without such a term the policy can achieve high rollout success while collapsing to reactive behavior. Please add a quantitative alignment metric between teacher and student future predictions and report its value in the results.

    Authors: We appreciate the referee's point that an explicit alignment term would more directly substantiate the transfer of anticipatory behavior. The current student-driven rollout objective matches the teacher policy's actions on trajectories that require proactive contact and balance decisions for success; this implicitly encourages the student to internalize future-aware behavior rather than purely reactive corrections. Nevertheless, we agree that a dedicated quantitative metric strengthens the central claim. In the revised manuscript we will introduce and report a teacher-student future-prediction alignment metric (mean L2 distance on predicted joint intents and contact flags over a 10-step horizon, computed on held-out rollouts) in the results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical rollouts

full rationale

The paper presents DAJI as a hierarchical framework with DAJI-Act performing distillation via student-driven rollouts from a future-aware teacher and DAJI-Flow generating intent chunks autoregressively. Reported metrics (94.42% rollout success, 0.152 subsequence FID) are obtained from closed-loop evaluations on HumanML3D and BABEL benchmarks. No equations, definitions, or self-citations in the abstract or described method reduce these outcomes to fitted parameters or prior results by construction. The distillation step is presented as an empirical procedure without a uniqueness theorem or ansatz imported from self-citation that would force the anticipatory properties. This is the common case of an empirical robotics paper whose central claims remain falsifiable via external rollouts.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions of teacher-student distillation in reinforcement learning and diffusion models; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)
  • diffusion and flow model hyperparameters
    Typical learned or tuned parameters in the student policy and autoregressive generator.
axioms (1)
  • domain assumption A future-aware teacher policy can be distilled into a real-time student without loss of anticipatory capability
    Central to the DAJI-Act component described in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 1148 out tokens · 43757 ms · 2026-05-15T02:11:06.582991+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

  1. [1]

    Movement,postureandequilibrium:Interaction and coordination

    J.Massion,“Movement,postureandequilibrium:Interaction and coordination”,Progress in neurobiology, vol. 38, no. 1, pp.35–56,1992

  2. [2]

    Posture,dynamicstability,andvol- untarymovement

    S.BouissetandM.-C.Do,“Posture,dynamicstability,andvol- untarymovement”,NeurophysiologieClinique/ClinicalNeuro- physiology,vol.38,no.6,pp.345–362,2008

  3. [3]

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

    E.Todorov,T.Erez,andY.Tassa,“Mujoco:Aphysicsengine formodel-basedcontrol”,in2012IEEE/RSJInternationalCon- ferenceonIntelligentRobotsandSystems,IEEE,2012,pp.5026– 5033.doi:10.1109/IROS.2012.6386109

  4. [4]

    The kit motion- languagedataset

    M. Plappert, C. Mandery, and T. Asfour, “The kit motion- languagedataset”,BigData,vol.4,no.4,pp.236–252,2016

  5. [5]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms”,CoRR, vol.abs/1707.06347,2017.arXiv:1707.06347.[Online].Avail- able:http://arxiv.org/abs/1707.06347

  6. [6]

    Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills

    X.B.Peng,P.Abbeel,S.Levine,andM.VandePanne,“Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills”,ACMTransactionsOnGraphics(TOG), vol.37,no.4,pp.1–14,2018

  7. [7]

    Denoisingdiffusionprobabilistic models

    J.Ho,A.Jain,andP.Abbeel,“Denoisingdiffusionprobabilistic models”,inAdvancesinNeuralInformationProcessingSystems, vol.33,CurranAssociates,Inc.,2020,pp.6840–6851.[Online]. Available: https://proceedings.neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

  8. [8]

    Babel:Bodies,actionandbehavior withenglishlabels

    A.R.Punnakkal,A.Chandrasekaran,N.Athanasiou,A.Quiros- Ramirez,andM.J.Black,“Babel:Bodies,actionandbehavior withenglishlabels”,inProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2021, pp.722–731

  9. [9]

    Denoisingdiffusionimplicit models

    J.Song,C.Meng,andS.Ermon,“Denoisingdiffusionimplicit models”,in9thInternationalConferenceonLearningRepresen- tations,ICLR2021,OpenReview.net,2021.[Online].Available: https://openreview.net/forum?id=St1giarCHLP

  10. [10]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    M.Ahnetal.,“Doasican,notasisay:Groundinglanguagein roboticaffordances”,arXivpreprintarXiv:2204.01691,2022

  11. [11]

    Generatingdiverseandnatural3dhumanmo- tionsfromtext

    C.Guoetal.,“Generatingdiverseandnatural3dhumanmo- tionsfromtext”,inCVPR,2022

  12. [12]

    Human Motion Diffusion Model

    G.Tevet,S.Raab,B.Gordon,Y.Shafir,D.Cohen-Or,andA.H. Bermano, “Human motion diffusion model”,arXiv preprint arXiv:2209.14916,2022

  13. [13]

    Momask: Generative masked modeling of 3d human motions

    C.Guo,Y.Mu,M.G.Javed,S.Wang,andL.Cheng,“Momask: Generative masked modeling of 3d human motions”,arXiv preprintarXiv:2312.00063,2023

  14. [14]

    Motionflowmatchingforhumanmotionsyn- thesisandediting

    V.T.Huetal.,“Motionflowmatchingforhumanmotionsyn- thesisandediting”,arXivpreprintarXiv:2312.08895,2023

  15. [15]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer:Composable3dvaluemapsforroboticmanipulation withlanguagemodels”,arXivpreprintarXiv:2307.05973,2023

  16. [16]

    Codeaspolicies:Languagemodelprogramsfor embodiedcontrol

    J.Liangetal.,“Codeaspolicies:Languagemodelprogramsfor embodiedcontrol”,in2023IEEEInternationalconferenceon roboticsandautomation(ICRA),IEEE,2023,pp.9493–9500

  17. [17]

    Flowmatchingforgenerativemodeling

    Y.Lipman,R.T.Q.Chen,H.Ben-Hamu,M.Nickel,andM.Le, “Flowmatchingforgenerativemodeling”,inTheEleventhIn- ternationalConferenceonLearningRepresentations,ICLR2023, OpenReview.net,2023.[Online].Available:https://openreview. net/forum?id=PqvMRDCJT9t

  18. [18]

    Perpetual humanoid controlforreal-timesimulatedavatars

    Z. Luo, J. Cao, K. Kitani, W. Xu, et al., “Perpetual humanoid controlforreal-timesimulatedavatars”,inProceedingsofthe IEEE/CVFInternationalConferenceonComputerVision,2023, pp.10895–10904

  19. [19]

    Scalable diffusion models with trans- formers

    W. Peebles and S. Xie, “Scalable diffusion models with trans- formers”,inProceedingsoftheIEEE/CVFInternationalConfer- enceonComputerVision(ICCV),Oct.2023,pp.4195–4205

  20. [20]

    Calm: Conditional adversarial latent models for directablevirtualcharacters

    C. Tessler, Y. Kasten, Y. Guo, S. Mannor, G. Chechik, and X. B. Peng, “Calm: Conditional adversarial latent models for directablevirtualcharacters”,inACMSIGGRAPH2023confer- enceproceedings,2023,pp.1–9

  21. [21]

    T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations

    J.Zhangetal.,“T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations”,inCVPR,2023

  22. [22]

    Seamless human motion composition with blended positional encodings

    G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings”, in CVPR,2024

  23. [23]

    Sato:Stabletext-to-motionframework

    W.chenetal.,“Sato:Stabletext-to-motionframework”,inPro- ceedings of the 32nd ACM International Conference on Mul- timedia, ser. MM ’24, Melbourne VIC, Australia: Associa- tion for Computing Machinery, 2024, pp. 6989–6997,isbn: 9798400706868.doi:10.1145/3664647.3681034[Online].Avail- able:https://doi.org/10.1145/3664647.3681034

  24. [24]

    Harmon: Whole-bodymotiongenerationofhumanoidrobotsfromlan- guage descriptions

    Z. Jiang, Y. Xie, J. Li, Y. Yuan, Y. Zhu, and Y. Zhu, “Harmon: Whole-bodymotiongenerationofhumanoidrobotsfromlan- guage descriptions”, inConference on Robot Learning, 2024. arXiv:2410.12773[cs.RO]

  25. [25]

    Qwen3-VL Technical Report

    S. Bai, Y. Cai, R. Chen, K. Chen, et al., “Qwen3-vl technical report”,arXivpreprintarXiv:2511.21631,2025.[Online].Avail- able:https://arxiv.org/abs/2511.21631

  26. [26]

    Ant: Adaptive neural temporal-aware text- to-motion model

    W. Chen et al., “Ant: Adaptive neural temporal-aware text- to-motion model”, inProceedings of the 33rd ACM Interna- tionalConferenceonMultimedia,ser.MM’25,ACM,Oct.2025, pp.9852–9861.doi:10.1145/3746027.3755168[Online].Avail- able:http://dx.doi.org/10.1145/3746027.3755168

  27. [27]

    Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

    W. Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

  28. [28]

    W.Chenetal.,Polaris:Projection-orthogonalleastsquaresfor robustandadaptiveinversionindiffusionmodels,2025.arXiv: 2512.00369 [cs.CV].[Online].Available:https://arxiv.org/abs/ 2512.00369

  29. [29]

    Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

    H. Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

  30. [30]

    [Online]

    arXiv: 2509.25304 [cs.CV]. [Online]. Available: https: //arxiv.org/abs/2509.25304

  31. [31]

    H.Jiaetal.,Physics-informedrepresentationalignmentforsparse radio-mapreconstruction,2025.arXiv:2501.19160 [cs.CV].[On- line].Available:https://arxiv.org/abs/2501.19160

  32. [32]

    Physics-informed representation alignment for sparse radio-map reconstruction

    H. Jia et al., “Physics-informed representation alignment for sparse radio-map reconstruction”, inProceedings of the 33rd ACM International Conference on Multimedia, ser. MM ’25, Dublin,Ireland:AssociationforComputingMachinery,2025, pp.12352–12360,isbn:9798400720352.doi:10.1145/3746027. 3758161[Online].Available:https://doi.org/10.1145/3746027. 3758161

  33. [33]

    Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025

    N. Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025. arXiv: 2512.24321 [cs.CV]

  34. [34]

    Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025

    Z. Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025. arXiv: 2510. 14952[cs.RO]. 9–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

  35. [35]

    Q. Lu, Y. Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu,Gen- tlehumanoid:Learningupper-bodycomplianceforcontact-rich humanandobjectinteraction,2025.arXiv:2511.04679 [cs.RO]. [Online].Available:https://arxiv.org/abs/2511.04679

  36. [36]

    arXiv: 2412.15032[cs.CV]

    M.Ningetal.,Dctdiff:Intriguingpropertiesofimagegenerative modeling in the dct space, 2025. arXiv: 2412.15032[cs.CV]. [Online].Available:https://arxiv.org/abs/2412.15032

  37. [37]

    Y.Shaoetal.,LangWBC:Language-directedhumanoidwhole- bodycontrolviaend-to-endlearning,2025.arXiv:2504.21738 [cs.RO]

  38. [38]

    Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces

    B.Tian,W.Chen,Z.Li,S.Lai,J.Wu,andY.Yue,“Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces”, inProceedingsofthe33rdACMInternationalConferenceonMul- timedia,ser.MM’25,Dublin,Ireland:AssociationforComput- ingMachinery,2025,pp.10152–10160,isbn:9798400720352. doi:10.1145/3746027.3755441[Online].Available:https://doi. org/10.1145/3746027.3755441

  39. [39]

    Y. Wang, H. Jiang, S. Yao, Z. Ding, and Z. Lu,SENTINEL: A fullyend-to-endlanguage-actionmodelforhumanoidwholebody control,2025.arXiv:2511.19236[cs.RO]

  40. [40]

    MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace

    L.Xiaoetal.,“MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace”, inProceedings of the IEEE/CVF International Conference on ComputerVision,2025.arXiv:2503.15451[cs.CV]

  41. [41]

    Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

    H. Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

  42. [42]

    H. Li, W. Chen, S. Liang, L. Wang, K. Yuan, and Y. Yue,𝑍2- sampling: Zero-cost zigzag trajectories for semantic alignment indiffusionmodels,2026.arXiv:2604.23536 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.23536

  43. [43]

    H. Li, W. Chen, L. Wang, S. Liang, H. Jia, and Y. Yue,Oracle noise:Fastersemanticsphericalalignmentforinterpretablela- tent optimization, 2026. arXiv: 2604.23540[cs.CV]. [Online]. Available:https://arxiv.org/abs/2604.23540

  44. [44]

    Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    H.Lietal.,Deltascorematters!spatialadaptivemultiguidance indiffusionmodels,2026.arXiv:2604.26503 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.26503

  45. [45]

    arXiv: 2601.12799 [cs.RO]

    P.Lietal.,FRoM-W1:Towardsgeneralhumanoidwhole-body control with language instructions, 2026. arXiv: 2601.12799 [cs.RO]

  46. [46]

    arXiv: 2602.07439 [cs.RO]

    W.Xieetal.,TextOp:Real-timeinteractivetext-drivenhumanoid robotmotiongenerationandcontrol, 2026. arXiv: 2602.07439 [cs.RO]

  47. [47]

    Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

    X. Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

  48. [48]

    Towards betterevaluationmetricsfortext-to-motiongeneration

    W.Chen,H.Jia,K.Yu,S.Lai,L.Wang,andY.Yue,“Towards betterevaluationmetricsfortext-to-motiongeneration”,inThe SecondInternationalWorkshoponTransformativeInsightsin MultifacetedEvaluationatTheWebConference2026. 10–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control A. EVALUATION PROTOCOL AND EXPERIMENTAL SETU...