arxiv: 2604.23532 · v1 · submitted 2026-04-26 · 💻 cs.CV · cs.AI

Recognition: unknown

Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model

Jingni Huang , Peter Bloodsworth

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords human pose forecastingemotion conditioningmultimodal fusionpredictive world modelLSTMgating mechanismshort-term predictionfacial expressions

0 comments

The pith

Facial emotion embeddings from expressions can serve as useful auxiliary signals for improving short-term human pose predictions in a lightweight model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether information about emotional state, pulled from facial expressions, can help a model predict how a person's body will move over the next few seconds. Most existing pose predictors use only the positions and velocities of joints, but the authors argue that emotions may shape the dynamics of those movements in measurable ways. They build a simple two-layer LSTM model that takes both pose keypoints and emotion embeddings, combines them with a learnable gate, and rolls out predictions autoregressively for 15 steps. On video data with natural emotion-driven motions, the gated version outperforms plain fusion and shows clear sensitivity when the emotion input is altered. If the approach holds, it opens a route to more responsive forecasts in settings where human motion is influenced by internal state rather than geometry alone.

Core claim

Incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach. Experiments on two small-scale datasets show that normalized gating fusion significantly improves accuracy on natural emotion-driven sequences, while simple multimodal fusion does not, and counterfactual perturbations confirm that predicted trajectories respond measurably to changes in the emotion input rather than treating it as redundant.

What carries the argument

Lightweight autoregressive predictive world model using a two-layer LSTM that fuses pose keypoints with emotion embeddings via a learnable normalized gating mechanism for 15-step rolling forecasts.

Load-bearing premise

Facial expression-derived emotion embeddings supply meaningful auxiliary signals that causally influence pose dynamics beyond geometric motion cues, and the two small-scale datasets suffice to demonstrate consistent benefits from the gating mechanism.

What would settle it

Replacing the emotion embeddings with random noise or zero vectors on the natural emotion-driven sequences and finding no increase in prediction error would falsify the claim that they function as useful conditional signals.

Figures

Figures reproduced from arXiv: 2604.23532 by Jingni Huang, Peter Bloodsworth.

**Figure 1.** Figure 1: Overall Framework of Emotion-Conditioned Pose Forecasting This figure provides a global view, showing the comparative paths from raw data to multiple baselines. Data Processing: Demonstrates the MediaPipe[14] extraction process and the construction of the sliding window seq_len=10. Model Architectures: Clearly compares three paths: A (Pose Baseline): Based solely on dynamics, without emotion guidance. B (F… view at source ↗

**Figure 2.** Figure 2: Gated Multi-Modal Fusion Framework This diagram illustrates the end-to-end implementation: Data Preprocessing: Raw pose and emotion files are cleaned and normalized before being fed into specialized data loaders. Model Architecture: The core "FusionPredictor" utilizes a learnable gating mechanism to balance the pose and emotion streams. The fused vector (size 86) is processed by a stacked LSTM (128 hidden … view at source ↗

**Figure 3.** Figure 3: Predictive World Model Architecture The framework is divided into three functional modules view at source ↗

read the original abstract

Short-term human pose prediction plays a crucial role in interactive systems, assistive robots, and emotion-aware human-computer interaction[1-3]. While current trajectory prediction models primarily rely on geometric motion cues, they often overlook the underlying emotional signals influencing human motion dynamics[4-5]. This paper investigates whether facial expression-derived emotion embeddings can provide auxiliary conditional signals for short-term pose prediction. To further evaluate multimodal conditionation in a recursive prediction setting, we propose a lightweight autoregressive predictive world model that performs 15-step rolling pose prediction. This framework combines pose keypoints with emotion embeddings through a learnable gating mechanism and performs autoregressive unfolding prediction using a recurrent sequence model based on a two-layer LSTM architecture. Experiments were conducted on two small-scale pose-emotion video datasets: controlled motion sequences with minimal facial expression changes and, natural emotion-driven motion sequences with considerable facial expression changes. The results show that simple multimodal fusion does not consistently improve prediction accuracy, while normalized gating fusion significantly enhances the performance of emotion-driven motion sequences. Furthermore, counterfactual perturbation experiments demonstrate that the predicted trajectory exhibits measurable sensitivity to changes in multimodal input, suggesting that facial expression embeddings act as auxiliary conditional signals rather than redundant features. In summary, these results indicate that incorporating facial expression-derived emotion embeddings into emotion-conditional short-term pose prediction based on a lightweight predictive world model architecture is a feasible approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that normalized gating of facial emotion embeddings into a basic LSTM improves short-horizon pose forecasts on expressive motion data while simple fusion does not, with counterfactual tests confirming the inputs are used.

read the letter

The main point is that a lightweight LSTM with learnable gating on pose-plus-emotion inputs can make short 15-step forecasts more accurate when the motion carries emotional content, but the benefit disappears on neutral sequences and simple concatenation adds nothing reliable. They run this on two small video sets, one controlled with minimal faces and one natural with clear expressions, then perturb the emotion signals to check sensitivity. The gating version responds to those changes while the ungated one does not as much. That setup isolates the contribution better than most multimodal pose papers I have seen lately. The architecture stays deliberately simple, which is a plus for anyone who wants to reproduce or extend it quickly. The counterfactual check is the strongest part because it gives an independent signal that the emotion channel is not just noise. The soft spots are the small dataset sizes, which make it hard to know how well this holds outside the training distribution, and the complete absence of numbers, error bars, or standard baselines in the abstract. Without those, it is difficult to judge whether the reported improvement is large enough to matter in practice or just a modest edge on one split. Fifteen steps is also a very short horizon, so the work does not test whether the same gating helps when errors accumulate further out. This is for researchers in robotics or HCI who already work on human motion and want a concrete, low-overhead way to add emotion conditioning. A reader hunting for a new theoretical framework or large-scale validation will not find it here. The experiments are set up cleanly enough to deserve referee time, even if the final numbers turn out modest. I would send it for review to get the full tables and any additional controls they ran.

Referee Report

2 major / 2 minor

Summary. The paper claims that facial expression-derived emotion embeddings can serve as useful auxiliary conditional signals for short-horizon (15-step) human pose forecasting. It introduces a lightweight autoregressive predictive world model based on a two-layer LSTM that fuses pose keypoints with emotion embeddings via a learnable normalized gating mechanism, shows that this outperforms simple multimodal fusion on natural emotion-driven sequences from one of two small-scale datasets, and uses counterfactual emotion perturbations to demonstrate measurable sensitivity in the predicted trajectories, concluding that the approach is feasible for emotion-conditioned pose prediction.

Significance. If the quantitative results hold, the work provides a practical demonstration that emotion signals can be incorporated into lightweight recurrent world models for short-term pose forecasting without requiring heavy architectures. The normalized gating mechanism and counterfactual sensitivity tests offer a concrete way to isolate the contribution of the auxiliary modality. However, the absence of numerical performance numbers, baselines, dataset statistics, or error analysis in the reported experiments limits the ability to judge the practical magnitude or robustness of the claimed benefit.

major comments (2)

[Abstract] Abstract: the central claim that 'normalized gating fusion significantly enhances the performance of emotion-driven motion sequences' is presented without any quantitative metrics (e.g., MPJPE, ADE/FDE), error bars, baseline comparisons (vanilla LSTM, other fusion methods), dataset sizes, train/test splits, or statistical significance tests. This information is load-bearing for evaluating whether the observed improvement supports the feasibility conclusion.
[Experiments] Experiments section (implied by abstract description): model parameters, including the learnable gating weights, are fitted on the same small-scale datasets used for evaluation. While counterfactual perturbations provide an independent sensitivity check, the performance comparison remains tied to in-distribution behavior on limited data, weakening the generalization aspect of the feasibility claim.

minor comments (2)

[Abstract] Abstract: the term 'multimodal conditionation' appears to be a typo for 'conditioning'; correct for clarity.
[Introduction] The abstract references [1-5] but does not indicate whether the full manuscript provides a detailed related-work discussion that situates the gating mechanism against prior multimodal fusion or world-model approaches in pose forecasting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Below, we respond to each major comment and indicate the changes we plan to implement in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'normalized gating fusion significantly enhances the performance of emotion-driven motion sequences' is presented without any quantitative metrics (e.g., MPJPE, ADE/FDE), error bars, baseline comparisons (vanilla LSTM, other fusion methods), dataset sizes, train/test splits, or statistical significance tests. This information is load-bearing for evaluating whether the observed improvement supports the feasibility conclusion.

Authors: We fully agree that the abstract would benefit from including quantitative evidence to support the central claim. Currently, the abstract provides a qualitative summary of the results. In the revision, we will update the abstract to incorporate specific performance metrics, including MPJPE and ADE/FDE values for the proposed normalized gating fusion versus baselines such as the vanilla LSTM and simple multimodal fusion. We will also include information on dataset sizes, train/test splits, and indicate that the improvements were validated with statistical significance tests. This revision will make the feasibility conclusion more robust and easier to evaluate. revision: yes
Referee: [Experiments] Experiments section (implied by abstract description): model parameters, including the learnable gating weights, are fitted on the same small-scale datasets used for evaluation. While counterfactual perturbations provide an independent sensitivity check, the performance comparison remains tied to in-distribution behavior on limited data, weakening the generalization aspect of the feasibility claim.

Authors: We appreciate the point regarding the in-distribution evaluation on small-scale datasets. The model parameters are indeed optimized on the evaluation datasets, which is standard but does limit strong generalization claims. The counterfactual perturbation analysis serves as an additional check for the role of emotion embeddings. In the revised version, we will add explicit discussion in the Experiments and Conclusions sections about the dataset limitations, the in-distribution nature of the results, and the scope of the feasibility demonstration. We will also include more detailed dataset statistics and train/test split information to provide better context. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical feasibility study of an LSTM-based autoregressive world model with gating for emotion-conditioned pose forecasting. The central claim rests on experimental results from training and evaluating on two small datasets, with ablation on fusion methods and counterfactual perturbations. No mathematical derivation chain, first-principles prediction, or load-bearing self-citation is present in the provided text; performance metrics are reported as post-training observations rather than being equivalent to inputs by construction. Standard supervised training on the evaluation distribution is not treated as circular under the guidelines, as the model architecture and gating mechanism introduce independent structure that is tested against baselines.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard recurrent sequence modeling assumptions and trained neural network parameters; no new physical entities or ad-hoc constants are introduced beyond the learnable gating weights.

free parameters (1)

learnable gating weights
Parameters of the gating mechanism that are optimized on the pose-emotion datasets to control fusion of the two modalities.

axioms (1)

standard math LSTM cells can capture temporal dependencies in human pose sequences sufficiently for short-horizon autoregressive prediction.
Invoked by the choice of two-layer LSTM as the core recurrent component.

pith-pipeline@v0.9.0 · 5540 in / 1394 out tokens · 79837 ms · 2026-05-08T06:42:30.674359+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Introduction Human motion prediction is a fundamental problem in computer vision with applications in human-robot interaction, behavior understanding, and interactive virtual environments[6-7]. Most existing short-horizon pose forecasting methods rely primarily on geometric motion representations extracted from body keypoints, while affective signals deri...
[2]

Related Work This work is inspired by the predictive world model[9]. In this model, intelligent systems learn compact latent representations to capture the dynamic changes of the physical world, rather than directly predicting pixel-level observations. Instead of simply optimizing short-term geometric accuracy, predictive world models aim to learn interna...
[3]

FusionPredictor

Model Architecture and Method 3.1 Model Architecture 3.1.1 Gated Multimodal Fusion Framework Figure 2: Gated Multi-Modal Fusion Framework This diagram illustrates the end-to-end implementation: Data Preprocessing: Raw pose and emotion files are cleaned and normalized before being fed into specialized data loaders. Model Architecture: The core "FusionPredi...
[4]

The dataset contains two subsets

Dataset Construction Because publicly available synchronized pose–emotion forecasting datasets are limited, we construct a pilot-scale multimodal evaluation dataset using publicly available video sequences. The dataset contains two subsets. Dataset I: Controlled Motion Sequence consists of 420 samples derived from Intel RealSense demonstration sequences[1...
[5]

The first dataset I consists of 420 samples of sequences from the Intel® OpenVINO™ Toolkit Sample Video Suite

Experiments We evaluate the proposed framework on two complementary video sources designed to test multimodal motion prediction under both controlled and in-the-wild conditions. The first dataset I consists of 420 samples of sequences from the Intel® OpenVINO™ Toolkit Sample Video Suite. These sequences provide stable motion trajectories and controlled re...
[6]

Simple multimodal fusion does not consistently improve prediction accuracy due to the modal imbalance between pose trajectories and emotional representations

Conclusion and Future Work Counterfactual perturbation experiments confirm the measurable sensitivity of predicted trajectories to mood changes, supporting the explanation that facial expression-derived embeddings act as auxiliary prediction conditional signals rather than the primary motion driver. Simple multimodal fusion does not consistently improve p...
[7]

Sam Toyer and Anoop Cherian and Tengda Han and Stephen Gould, Human Pose Forecasting via Deep Markov Models, 2017, arXiv:1707.09240, https://arxiv.org/abs/1707.09240

work page arXiv 2017
[8]

Deep learning-based approaches for human pose estimation in interdisciplinary physics applications

Zhiliang, L., Zhuo, L. Deep learning-based approaches for human pose estimation in interdisciplinary physics applications. Sci Rep 15, 42883 (2025). https://doi.org/10.1038/s41598-025-26972-4

work page doi:10.1038/s41598-025-26972-4 2025
[9]

Peide Huang and Yuhan Hu and Nataliya Nechyporenko and Daehwa Kim and Walter Talbott and Jian Zhang, EMOTION: Expressive Motion Sequence Generation for Humanoid Robots with In-Context Learning, 2025, https://arxiv.org/abs/2410.23234 11

work page arXiv 2025
[10]

Chongyang Zhong, Lei Hu, Shihong Xia, Spatial - temporal modeling for prediction of stylized human motion, 2022, https://www.sciencedirect.com/science/article/pii/S092523122201075

2022
[11]

Chen,Z. (2025). Exploring Multimodal Emotion Perception and Expression in Humanoid Robots. Applied and Computational Engineering,174,85-90. https://ace.ewapub.com/article/view/24880

2025
[12]

Yucheng Huang, Hong Yan, Human Trajectory Prediction Based on a Single Frame of Pose and Initial Velocity Information, 2025, https://www.mdpi.com/2079- 9292/14/13/2636

2025
[13]

Hsu-kuang Chiu, Ehsan Adeli, Borui Wang, De-An Huang, Juan Carlos Niebles , Action-Agnostic Human Pose Forecasting, 2018, https://arxiv.org/abs/1810.09676

work page arXiv 2018
[14]

A-Seong Moon, Haesung Kim, Ye-Chan Park, Jaesung Lee , A Survey on Multimodal Emotion Recognition: Methods, Datasets, and Future Directions, 2026, https://www.techscience.com/cmc/v87n2/66647/html

2026
[15]

Jyothir S V, Siddhartha Jalagam, Yann LeCun, Vlad Sobal , Gradient-based Planning with World Models, 2023, https://arxiv.org/abs/2312.17227

work page arXiv 2023
[16]

Nicklas Hansen, Jyothir S V, Vlad Sobal, Yann LeCun, Xiaolong Wang, Hao Su, Hierarchical World Models as Visual Whole-Body Humanoid Controllers, 2025, https://arxiv.org/abs/2405.18418

work page arXiv 2025
[17]

Michael Rabbat, Michael Psenka, Aditi Krishnapriyan, Yann LeCun, Amir Bar, Parallel Stochastic Gradient-Based Planning for World Models, 2026, https://arxiv.org/abs/2602.00475

work page arXiv 2026
[18]

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, Randall Balestriero,Mila & Université de Montréal, New York University Samsung SAIL Brown University, Equal Contribution, LeWorldModel: Stable End-to-End JEPA from Pixels, 2026, https://le-wm.github.io

2026
[19]

Intel, OpenVINO Toolkit, 2024, https://docs.openvino.ai/2024/notebooks/pose- estimation-with-output.html

2024
[20]

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan- Teh Chang, Wei Hua, Manfred Georg, Matthias Grundmann, MediaPipe: A Framework for Building Perception Pipelines, 2019, https://doi.org/10.48550/arXiv.1906.08172

work page internal anchor Pith review doi:10.48550/arxiv.1906.08172 2019
[21]

Yury Kartynnik, Artsiom Ablavatski, Ivan Grishchenko, Matthias Grundmann, Real- time Facial Surface Geometry from Monocular Video on Mobile GPUs, 2019, https://doi.org/10.48550/arXiv.1907.06724

work page doi:10.48550/arxiv.1907.06724 2019
[22]

Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang, Matthias Grundmann, BlazePose: On-device Real-time Body Pose tracking, 2020, https://doi.org/10.48550/arXiv.2006.10204

work page doi:10.48550/arxiv.2006.10204 2020