arxiv: 2605.14417 · v1 · submitted 2026-05-14 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Haozhe Jia , Honglei Jin , Yuan Zhang , Youcheng Fan , Shaofeng Liang , Lei Wang , Shuxu Jin , Kuimou Yu

show 4 more authors

Zinuo Zhang Jianfei Song Wenshuo Chen Yutao Yue

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:11 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords language-conditioned humanoid controlanticipatory joint intentdiffusion action policystreaming whole-body controldynamics alignmentteacher-student distillationfuture intent generation

0 comments

The pith

DAJI learns dynamics-aligned joint intents so language commands produce humanoid actions that are both immediate and anticipatory of future contacts and balance shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to replace reactive repair in language-driven humanoid control with an explicit anticipatory interface. It claims that a hierarchical structure can generate joint intents that encode upcoming physical transitions while remaining executable in closed loop. The approach uses a teacher-student distillation step to transfer future awareness into a deployable diffusion policy and an autoregressive flow model to produce future intent chunks from language and history. A reader would care because current kinematic or latent methods leave the low-level controller to handle transitions after they arise, which limits fluid streaming performance. If the claim holds, language becomes a more reliable interface for whole-body tasks that require preparation before movement begins.

Core claim

DAJI is a hierarchical framework whose DAJI-Act component distills a future-aware teacher policy into a deployable diffusion action policy through student-driven rollouts, while its DAJI-Flow component autoregressively generates future intent chunks conditioned on language and prior intent. This joint-intent layer sits between language generation and closed-loop control and explicitly encodes upcoming contact changes, support transfers, and balance preparation. Experiments report 94.42% rollout success on HumanML3D-style generation tasks and 0.152 subsequence FID on BABEL streaming benchmarks.

What carries the argument

Dynamics-Aligned Joint Intent (DAJI), a hierarchical anticipatory interface that encodes future contact and balance information between language inputs and low-level control outputs.

If this is right

Enables language-conditioned generation of whole-body actions that prepare for upcoming contacts rather than repairing after the fact.
Supports streaming instruction following at 0.152 subsequence FID while maintaining 94.42% rollout success on standard benchmarks.
Provides an explicit latent representation of future joint states that low-level trackers can use without reactive correction.
Allows single-instruction and multi-turn language commands to produce coherent sequences of anticipatory intents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation-plus-autoregressive structure could be tested on non-humanoid platforms where anticipatory contact planning matters, such as wheeled or legged mobile manipulators.
If the joint-intent layer proves stable, it could reduce reliance on separate kinematic reference generators that later require low-level repair.
Extending the intent chunks to include visual or tactile predictions would be a direct next measurement of whether the anticipatory signal generalizes beyond motion only.

Load-bearing premise

Distilling a future-aware teacher policy into a deployable diffusion action policy via student-driven rollouts preserves the anticipatory properties without substantial degradation in closed-loop performance.

What would settle it

A closed-loop evaluation that shows substantially lower rollout success or markedly higher subsequence FID when the distilled student policy is used instead of the teacher would indicate that the distillation step fails to retain anticipatory capability.

Figures

Figures reproduced from arXiv: 2605.14417 by Haozhe Jia, Honglei Jin, Jianfei Song, Kuimou Yu, Lei Wang, Shaofeng Liang, Shuxu Jin, Wenshuo Chen, Youcheng Fan, Yuan Zhang, Yutao Yue, Zinuo Zhang.

**Figure 1.** Figure 1: Teaser of DAJI. Instead of using reference trajectories as the deployment interface, DAJI predicts executable and anticipatory joint-intent latents that improve future prediction and long-horizon humanoid control. 1–20 arXiv:2605.14417v1 [cs.RO] 14 May 2026 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: DAJI framework. DAJI separates online deployment and offline training. DAJI-Flow predicts joint-intent latents from language and latent history, while DAJI-Act decodes each intent with live proprioception. DAJI-Act learns the executable joint-intent interface through student-driven in-the-loop distillation from a future-aware privileged teacher. learns the interface between the two modules: DAJI-Flow gener… view at source ↗

**Figure 3.** Figure 3: Tracker-level rollout visualization in simulation. DAJI decodes generated joint-intent latents into continuous whole-body humanoid motions, including dynamic and highly articulated behaviors. where 𝛼̄ 𝜏 is the cumulative noise schedule. Conditioned on 𝐜𝑡 , a lightweight denoiser 𝒟𝜙 predicts the clean action. The DAJI-Act objective is ℒAct = 𝔼(𝐨𝑡 ,𝐚 tea 𝑡 )∼𝒟student, 𝜏,𝝐𝑎,𝝐𝑧 [ ‖‖‖‖ 𝐚 tea 𝑡 − 𝒟𝜙 (𝐱𝜏 , 𝜏, 𝐜𝑡 … view at source ↗

**Figure 4.** Figure 4: Qualitative deployment results on physical humanoid hardware. DAJI produces executable motions under both streaming instruction switches and single-instruction generation. Any object-related phrases shown in qualitative prompts are interpreted only as body-motion descriptions; no object state or manipulation outcome is modeled or evaluated. 4.4. Main Benchmarks 4.4.1. HumanML3D-Style Robot Motion Generatio… view at source ↗

**Figure 5.** Figure 5: Sim (MuJoCo) Validation Results: Robust tracking performance on simple gestures to complex maneuvers. Any prompt phrase is interpreted only as a body-motion description; no object state or manipulation outcome is modeled or evaluated. 18–20 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

read the original abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAJI gives a hierarchical anticipatory interface for language humanoid control, though the distillation's ability to keep future awareness intact is the part that needs the most checking.

read the letter

The paper's core contribution is a two-part framework called DAJI that sits between language input and low-level humanoid control. DAJI-Flow generates chunks of future joint intent autoregressively, and DAJI-Act distills a teacher policy that sees ahead into a deployable diffusion policy using rollouts from the student itself. This is new in how it makes the intent layer explicitly dynamics-aligned, so the control outputs are meant to prepare for contact changes and balance shifts rather than just track current commands. The experiments claim solid numbers on HumanML3D generation success and BABEL subsequence FID, which suggests the streaming instruction following works in their tests. What the work does well is identify the reactive repair problem in prior language-to-motion pipelines and propose an explicit anticipatory interface to reduce it. The hierarchical split lets the high-level part handle language while the low-level stays closed-loop. The main soft spot is whether the student-driven rollout distillation actually carries over the teacher's future awareness. Without an explicit future-horizon loss or a metric that checks how well the policy anticipates contacts before they happen, the diffusion policy could learn to match the teacher's immediate actions while losing the look-ahead. The abstract positions the diffusion policy as the executable piece, so this preservation is central, and it would help to see ablations that isolate the anticipation component. The citation pattern looks standard for the area, pulling from motion generation and diffusion policy work. The math is not shown in the abstract, but the approach sounds like a reasonable combination of known techniques. This paper is for robotics researchers focused on language-conditioned whole-body control and hierarchical policies. Anyone building streaming systems for humanoids would find the interface idea useful. I would recommend sending it for peer review. The problem is real, the framing is clean, and the results are promising enough to warrant detailed referee feedback on the distillation mechanics.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes DAJI (Dynamics-Aligned Joint Intent), a hierarchical framework for language-conditioned humanoid control. DAJI-Act distills a future-aware teacher policy into a deployable diffusion action policy via student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments claim strong performance in anticipatory latent learning, single-instruction generation, and streaming instruction following, with 94.42% rollout success on HumanML3D-style tasks and 0.152 subsequence FID on BABEL.

Significance. If the distillation successfully transfers anticipatory properties to the closed-loop diffusion policy, the work could advance language-conditioned whole-body control by enabling proactive handling of contact changes, support transfers, and balance preparation rather than relying on reactive low-level repairs. The separation of future-intent generation from executable action diffusion offers a scalable interface for streaming humanoid tasks.

major comments (1)

[§3.2] §3.2 (DAJI-Act distillation procedure): the student-driven rollout objective contains no explicit future-horizon loss, teacher-student anticipation alignment term, or future-state prediction error regularizer. This is load-bearing for the central claim that the deployable diffusion policy encodes upcoming contact changes and balance preparation, because without such a term the policy can achieve high rollout success while collapsing to reactive behavior. Please add a quantitative alignment metric between teacher and student future predictions and report its value in the results.

minor comments (2)

[Abstract] Abstract: quantitative claims (94.42% success, 0.152 FID) are stated without naming the baselines, ablation variants, or error-bar statistics; a one-sentence reference to the comparison methods would improve readability.
[§2] Notation: the distinction between 'joint intent' and 'action' chunks is introduced without a compact equation or diagram in the early sections; adding a small schematic would clarify the hierarchical interface.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestion regarding the distillation procedure. We address the major comment below and outline the planned revision.

read point-by-point responses

Referee: [§3.2] §3.2 (DAJI-Act distillation procedure): the student-driven rollout objective contains no explicit future-horizon loss, teacher-student anticipation alignment term, or future-state prediction error regularizer. This is load-bearing for the central claim that the deployable diffusion policy encodes upcoming contact changes and balance preparation, because without such a term the policy can achieve high rollout success while collapsing to reactive behavior. Please add a quantitative alignment metric between teacher and student future predictions and report its value in the results.

Authors: We appreciate the referee's point that an explicit alignment term would more directly substantiate the transfer of anticipatory behavior. The current student-driven rollout objective matches the teacher policy's actions on trajectories that require proactive contact and balance decisions for success; this implicitly encourages the student to internalize future-aware behavior rather than purely reactive corrections. Nevertheless, we agree that a dedicated quantitative metric strengthens the central claim. In the revised manuscript we will introduce and report a teacher-student future-prediction alignment metric (mean L2 distance on predicted joint intents and contact flags over a 10-step horizon, computed on held-out rollouts) in the results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical rollouts

full rationale

The paper presents DAJI as a hierarchical framework with DAJI-Act performing distillation via student-driven rollouts from a future-aware teacher and DAJI-Flow generating intent chunks autoregressively. Reported metrics (94.42% rollout success, 0.152 subsequence FID) are obtained from closed-loop evaluations on HumanML3D and BABEL benchmarks. No equations, definitions, or self-citations in the abstract or described method reduce these outcomes to fitted parameters or prior results by construction. The distillation step is presented as an empirical procedure without a uniqueness theorem or ansatz imported from self-citation that would force the anticipatory properties. This is the common case of an empirical robotics paper whose central claims remain falsifiable via external rollouts.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard assumptions of teacher-student distillation in reinforcement learning and diffusion models; no explicit free parameters or invented entities are named in the abstract.

free parameters (1)

diffusion and flow model hyperparameters
Typical learned or tuned parameters in the student policy and autoregressive generator.

axioms (1)

domain assumption A future-aware teacher policy can be distilled into a real-time student without loss of anticipatory capability
Central to the DAJI-Act component described in the abstract.

pith-pipeline@v0.9.0 · 5519 in / 1148 out tokens · 43757 ms · 2026-05-15T02:11:06.582991+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 8 internal anchors

[1]

Movement,postureandequilibrium:Interaction and coordination

J.Massion,“Movement,postureandequilibrium:Interaction and coordination”,Progress in neurobiology, vol. 38, no. 1, pp.35–56,1992

work page 1992
[2]

Posture,dynamicstability,andvol- untarymovement

S.BouissetandM.-C.Do,“Posture,dynamicstability,andvol- untarymovement”,NeurophysiologieClinique/ClinicalNeuro- physiology,vol.38,no.6,pp.345–362,2008

work page 2008
[3]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

E.Todorov,T.Erez,andY.Tassa,“Mujoco:Aphysicsengine formodel-basedcontrol”,in2012IEEE/RSJInternationalCon- ferenceonIntelligentRobotsandSystems,IEEE,2012,pp.5026– 5033.doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[4]

The kit motion- languagedataset

M. Plappert, C. Mandery, and T. Asfour, “The kit motion- languagedataset”,BigData,vol.4,no.4,pp.236–252,2016

work page 2016
[5]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms”,CoRR, vol.abs/1707.06347,2017.arXiv:1707.06347.[Online].Avail- able:http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills

X.B.Peng,P.Abbeel,S.Levine,andM.VandePanne,“Deep- mimic:Example-guideddeepreinforcementlearningofphysics- basedcharacterskills”,ACMTransactionsOnGraphics(TOG), vol.37,no.4,pp.1–14,2018

work page 2018
[7]

Denoisingdiffusionprobabilistic models

J.Ho,A.Jain,andP.Abbeel,“Denoisingdiffusionprobabilistic models”,inAdvancesinNeuralInformationProcessingSystems, vol.33,CurranAssociates,Inc.,2020,pp.6840–6851.[Online]. Available: https://proceedings.neurips.cc/paper/2020/file/ 4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020
[8]

Babel:Bodies,actionandbehavior withenglishlabels

A.R.Punnakkal,A.Chandrasekaran,N.Athanasiou,A.Quiros- Ramirez,andM.J.Black,“Babel:Bodies,actionandbehavior withenglishlabels”,inProceedingsoftheIEEE/CVFConference onComputerVisionandPatternRecognition(CVPR),Jun.2021, pp.722–731

work page 2021
[9]

Denoisingdiffusionimplicit models

J.Song,C.Meng,andS.Ermon,“Denoisingdiffusionimplicit models”,in9thInternationalConferenceonLearningRepresen- tations,ICLR2021,OpenReview.net,2021.[Online].Available: https://openreview.net/forum?id=St1giarCHLP

work page 2021
[10]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M.Ahnetal.,“Doasican,notasisay:Groundinglanguagein roboticaffordances”,arXivpreprintarXiv:2204.01691,2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Generatingdiverseandnatural3dhumanmo- tionsfromtext

C.Guoetal.,“Generatingdiverseandnatural3dhumanmo- tionsfromtext”,inCVPR,2022

work page 2022
[12]

Human Motion Diffusion Model

G.Tevet,S.Raab,B.Gordon,Y.Shafir,D.Cohen-Or,andA.H. Bermano, “Human motion diffusion model”,arXiv preprint arXiv:2209.14916,2022

work page internal anchor Pith review arXiv 2022
[13]

Momask: Generative masked modeling of 3d human motions

C.Guo,Y.Mu,M.G.Javed,S.Wang,andL.Cheng,“Momask: Generative masked modeling of 3d human motions”,arXiv preprintarXiv:2312.00063,2023

work page arXiv 2023
[14]

Motionflowmatchingforhumanmotionsyn- thesisandediting

V.T.Huetal.,“Motionflowmatchingforhumanmotionsyn- thesisandediting”,arXivpreprintarXiv:2312.08895,2023

work page arXiv 2023
[15]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei, “Voxposer:Composable3dvaluemapsforroboticmanipulation withlanguagemodels”,arXivpreprintarXiv:2307.05973,2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Codeaspolicies:Languagemodelprogramsfor embodiedcontrol

J.Liangetal.,“Codeaspolicies:Languagemodelprogramsfor embodiedcontrol”,in2023IEEEInternationalconferenceon roboticsandautomation(ICRA),IEEE,2023,pp.9493–9500

work page 2023
[17]

Flowmatchingforgenerativemodeling

Y.Lipman,R.T.Q.Chen,H.Ben-Hamu,M.Nickel,andM.Le, “Flowmatchingforgenerativemodeling”,inTheEleventhIn- ternationalConferenceonLearningRepresentations,ICLR2023, OpenReview.net,2023.[Online].Available:https://openreview. net/forum?id=PqvMRDCJT9t

work page 2023
[18]

Perpetual humanoid controlforreal-timesimulatedavatars

Z. Luo, J. Cao, K. Kitani, W. Xu, et al., “Perpetual humanoid controlforreal-timesimulatedavatars”,inProceedingsofthe IEEE/CVFInternationalConferenceonComputerVision,2023, pp.10895–10904

work page 2023
[19]

Scalable diffusion models with trans- formers

W. Peebles and S. Xie, “Scalable diffusion models with trans- formers”,inProceedingsoftheIEEE/CVFInternationalConfer- enceonComputerVision(ICCV),Oct.2023,pp.4195–4205

work page 2023
[20]

Calm: Conditional adversarial latent models for directablevirtualcharacters

C. Tessler, Y. Kasten, Y. Guo, S. Mannor, G. Chechik, and X. B. Peng, “Calm: Conditional adversarial latent models for directablevirtualcharacters”,inACMSIGGRAPH2023confer- enceproceedings,2023,pp.1–9

work page 2023
[21]

T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations

J.Zhangetal.,“T2m-gpt:Generatinghumanmotionfromtex- tualdescriptionswithdiscreterepresentations”,inCVPR,2023

work page 2023
[22]

Seamless human motion composition with blended positional encodings

G. Barquero, S. Escalera, and C. Palmero, “Seamless human motion composition with blended positional encodings”, in CVPR,2024

work page 2024
[23]

Sato:Stabletext-to-motionframework

W.chenetal.,“Sato:Stabletext-to-motionframework”,inPro- ceedings of the 32nd ACM International Conference on Mul- timedia, ser. MM ’24, Melbourne VIC, Australia: Associa- tion for Computing Machinery, 2024, pp. 6989–6997,isbn: 9798400706868.doi:10.1145/3664647.3681034[Online].Avail- able:https://doi.org/10.1145/3664647.3681034

work page doi:10.1145/3664647.3681034 2024
[24]

Harmon: Whole-bodymotiongenerationofhumanoidrobotsfromlan- guage descriptions

Z. Jiang, Y. Xie, J. Li, Y. Yuan, Y. Zhu, and Y. Zhu, “Harmon: Whole-bodymotiongenerationofhumanoidrobotsfromlan- guage descriptions”, inConference on Robot Learning, 2024. arXiv:2410.12773[cs.RO]

work page arXiv 2024
[25]

Qwen3-VL Technical Report

S. Bai, Y. Cai, R. Chen, K. Chen, et al., “Qwen3-vl technical report”,arXivpreprintarXiv:2511.21631,2025.[Online].Avail- able:https://arxiv.org/abs/2511.21631

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Ant: Adaptive neural temporal-aware text- to-motion model

W. Chen et al., “Ant: Adaptive neural temporal-aware text- to-motion model”, inProceedings of the 33rd ACM Interna- tionalConferenceonMultimedia,ser.MM’25,ACM,Oct.2025, pp.9852–9861.doi:10.1145/3746027.3755168[Online].Avail- able:http://dx.doi.org/10.1145/3746027.3755168

work page doi:10.1145/3746027.3755168 2025
[27]

Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

W. Chen et al.,Free-t2m:Robusttext-to-motiongenerationfor humanoidrobotsviafrequency-domain,2025.arXiv:2501.18232 [cs.CV].[Online].Available:https://arxiv.org/abs/2501.18232

work page arXiv 2025
[28]

W.Chenetal.,Polaris:Projection-orthogonalleastsquaresfor robustandadaptiveinversionindiffusionmodels,2025.arXiv: 2512.00369 [cs.CV].[Online].Available:https://arxiv.org/abs/ 2512.00369

work page arXiv 2025
[29]

Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

H. Jia et al.,Luma: Low-dimension unified motion alignment with dual-path anchoring for text-to-motion diffusion model,

work page
[30]

[Online]

arXiv: 2509.25304 [cs.CV]. [Online]. Available: https: //arxiv.org/abs/2509.25304

work page arXiv
[31]

H.Jiaetal.,Physics-informedrepresentationalignmentforsparse radio-mapreconstruction,2025.arXiv:2501.19160 [cs.CV].[On- line].Available:https://arxiv.org/abs/2501.19160

work page arXiv 2025
[32]

Physics-informed representation alignment for sparse radio-map reconstruction

H. Jia et al., “Physics-informed representation alignment for sparse radio-map reconstruction”, inProceedings of the 33rd ACM International Conference on Multimedia, ser. MM ’25, Dublin,Ireland:AssociationforComputingMachinery,2025, pp.12352–12360,isbn:9798400720352.doi:10.1145/3746027. 3758161[Online].Available:https://doi.org/10.1145/3746027. 3758161

work page doi:10.1145/3746027 2025
[33]

Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025

N. Jiang et al.,UniAct: Unified motion generation and ac- tion streaming for humanoid robots, 2025. arXiv: 2512.24321 [cs.CV]

work page arXiv 2025
[34]

Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025

Z. Li et al.,From language to locomotion: Retargeting-free hu- manoidcontrolviamotionlatentguidance, 2025. arXiv: 2510. 14952[cs.RO]. 9–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

work page 2025
[35]

Q. Lu, Y. Feng, B. Shi, M. Piseno, Z. Bao, and C. K. Liu,Gen- tlehumanoid:Learningupper-bodycomplianceforcontact-rich humanandobjectinteraction,2025.arXiv:2511.04679 [cs.RO]. [Online].Available:https://arxiv.org/abs/2511.04679

work page arXiv 2025
[36]

arXiv: 2412.15032[cs.CV]

M.Ningetal.,Dctdiff:Intriguingpropertiesofimagegenerative modeling in the dct space, 2025. arXiv: 2412.15032[cs.CV]. [Online].Available:https://arxiv.org/abs/2412.15032

work page arXiv 2025
[37]

Y.Shaoetal.,LangWBC:Language-directedhumanoidwhole- bodycontrolviaend-to-endlearning,2025.arXiv:2504.21738 [cs.RO]

work page arXiv 2025
[38]

Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces

B.Tian,W.Chen,Z.Li,S.Lai,J.Wu,andY.Yue,“Text2weight: Bridgingnaturallanguageandneuralnetworkweightspaces”, inProceedingsofthe33rdACMInternationalConferenceonMul- timedia,ser.MM’25,Dublin,Ireland:AssociationforComput- ingMachinery,2025,pp.10152–10160,isbn:9798400720352. doi:10.1145/3746027.3755441[Online].Available:https://doi. org/10.1145/3746027.3755441

work page doi:10.1145/3746027.3755441 2025
[39]

Y. Wang, H. Jiang, S. Yao, Z. Ding, and Z. Lu,SENTINEL: A fullyend-to-endlanguage-actionmodelforhumanoidwholebody control,2025.arXiv:2511.19236[cs.RO]

work page arXiv 2025
[40]

MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace

L.Xiaoetal.,“MotionStreamer:Streamingmotiongeneration viadiffusion-basedautoregressivemodelincausallatentspace”, inProceedings of the IEEE/CVF International Conference on ComputerVision,2025.arXiv:2503.15451[cs.CV]

work page arXiv 2025
[41]

Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

H. Jia et al.,ECHO: Edge-cloud humanoid orchestration for language-to-motioncontrol,2026.arXiv:2603.16188[cs.CV]

work page arXiv 2026
[42]

H. Li, W. Chen, S. Liang, L. Wang, K. Yuan, and Y. Yue,𝑍2- sampling: Zero-cost zigzag trajectories for semantic alignment indiffusionmodels,2026.arXiv:2604.23536 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.23536

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

H. Li, W. Chen, L. Wang, S. Liang, H. Jia, and Y. Yue,Oracle noise:Fastersemanticsphericalalignmentforinterpretablela- tent optimization, 2026. arXiv: 2604.23540[cs.CV]. [Online]. Available:https://arxiv.org/abs/2604.23540

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

H.Lietal.,Deltascorematters!spatialadaptivemultiguidance indiffusionmodels,2026.arXiv:2604.26503 [cs.CV].[Online]. Available:https://arxiv.org/abs/2604.26503

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

arXiv: 2601.12799 [cs.RO]

P.Lietal.,FRoM-W1:Towardsgeneralhumanoidwhole-body control with language instructions, 2026. arXiv: 2601.12799 [cs.RO]

work page arXiv 2026
[46]

arXiv: 2602.07439 [cs.RO]

W.Xieetal.,TextOp:Real-timeinteractivetext-drivenhumanoid robotmotiongenerationandcontrol, 2026. arXiv: 2602.07439 [cs.RO]

work page arXiv 2026
[47]

Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

X. Yuan et al.,RoboForge: Physically optimized text-guided whole-bodylocomotionforhumanoids,2026.arXiv:2603.17927 [cs.RO]

work page arXiv 2026
[48]

Towards betterevaluationmetricsfortext-to-motiongeneration

W.Chen,H.Jia,K.Yu,S.Lai,L.Wang,andY.Yue,“Towards betterevaluationmetricsfortext-to-motiongeneration”,inThe SecondInternationalWorkshoponTransformativeInsightsin MultifacetedEvaluationatTheWebConference2026. 10–20 Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control A. EVALUATION PROTOCOL AND EXPERIMENTAL SETU...

work page