Recognition: unknown
Introducing WARM-VR: Benchmark Dataset for Multimodal Wearable Affect Recognition in Virtual Reality
Pith reviewed 2026-05-09 20:44 UTC · model grok-4.3
The pith
WARM-VR supplies a public multimodal dataset of wearable signals collected during stress induction and multisensory relaxation in virtual reality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish WARM-VR as a publicly available dataset of multimodal wearable recordings from 31 participants aged 19-37, captured during VR sessions that first induce stress through an arithmetic task and then promote relaxation in a calming beach environment with added olfactory stimuli, validated by significant reductions in negative affect on questionnaires and supported by initial machine learning benchmarks on the physiological signals.
What carries the argument
The WARM-VR dataset of synchronized BVP, EDA, skin temperature, three-axis acceleration, and ECG signals paired with self-report questionnaires, collected under a VR protocol that sequences arithmetic stress induction with multisensory beach relaxation.
If this is right
- Machine learning models trained on the data can classify valence from BVP signals at F1 0.63 and AUC 0.69 using CNN or CNN-Bi-GRU architectures.
- A lightweight transformer yields balanced results for arousal classification with F1 scores of 0.54 and 0.63.
- A CNN-Bi-GRU model achieves the highest performance in the relaxation task at average F1 0.64 and AUC 0.69.
- Olfactory enhancement during VR relaxation produces stronger reductions in negative affect than visual-auditory stimuli alone.
Where Pith is reading between the lines
- The dataset could enable real-time adaptive VR systems that adjust content based on detected user affect during therapy or training sessions.
- Similar multisensory protocols might be tested in other immersive settings to expand affect recognition beyond current VR setups.
- Public release of the raw signals and labels allows independent verification and extension to additional sensor combinations or participant groups.
Load-bearing premise
The specific sequence of arithmetic stress induction followed by beach relaxation with visual, auditory, and olfactory stimuli produces reliably distinct and measurable affective states as shown by both questionnaires and physiological changes.
What would settle it
If a replication with the same protocol finds no statistically significant drop in negative affect scores from stress to relaxation phases or if physiological signals show no consistent correlation with the self-reports, the dataset would not support its intended use for affect recognition.
Figures
read the original abstract
With the growing integration of human-computer interaction into everyday life, advances in machine learning have enabled systems to better perceive and respond to users' emotional states. Most existing affect recognition datasets focus on static environments, limiting their applicability to immersive multimedia contexts such as Virtual Reality (VR). In this paper, we introduce WARM-VR, a novel publicly available multimodal dataset designed to support affect recognition in immersive, multisensory environments using wearable sensing instrumentation. Data were collected from 31 participants aged 19-37 using wearable sensors: a wristband measuring Blood Volume Pulse (BVP), EDA, skin Temperature, three-axis Acceleration, and a chest strap recording ECG signals. Participants engaged in immersive VR experiences designed to elicit relaxation through a calming beach environment following stress induction via an arithmetic task. These sessions incorporated synchronized multimedia stimuli: visual, auditory, and olfactory. Affective states were assessed subjectively through validated self-report questionnaires and objectively through the analysis of physiological measurements. Statistical analysis of the questionnaires confirmed that VR relaxation significantly reduced negative affect, particularly with olfactory enhancement. Furthermore, we established a benchmark on the dataset using widely recognized machine learning algorithms. The best performance for binary classification from BVP data of valence, was obtained with a CNN and a CNN-Bi-GRU model, both achieving an average F1-score of 0.63 and an AUC of 0.69. For arousal, a lightweight Transformer architecture provided the most balanced results (F1-0 0.54 and F1-1 0.63), outperforming recurrent hybrids. In the relaxation task, a CNN-Bi-GRU model reached the highest overall performance (average F1-score 0.64, AUC 0.69).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WARM-VR, a publicly available multimodal dataset collected from 31 participants (aged 19-37) using wearable sensors (wristband for BVP, EDA, skin temperature, acceleration; chest strap for ECG) during VR sessions that induce stress via an arithmetic task followed by relaxation in a multisensory beach environment (visual, auditory, and olfactory stimuli). Affective states are validated through self-report questionnaires showing significant reduction in negative affect (especially with olfactory enhancement), and the paper provides baseline ML benchmarks for binary valence and arousal classification from the physiological signals, reporting best results of average F1=0.63 and AUC=0.69 using CNN and CNN-Bi-GRU models on BVP data for valence.
Significance. If the induced affective states are reliably captured in the wearable signals and the dataset is properly documented, WARM-VR would fill a gap by providing the first public multimodal wearable dataset for affect recognition in immersive, multisensory VR, enabling future work on VR-specific HCI and affective computing. The public release, inclusion of olfactory stimuli, and combination of subjective and objective measures are positive contributions; however, the modest benchmark performance limits immediate utility claims.
major comments (3)
- Abstract: The reported best performance for binary valence classification from BVP data (F1=0.63, AUC=0.69 with CNN and CNN-Bi-GRU) is only marginally above chance; this undermines the central claim that the dataset supports affect recognition unless the paper demonstrates (via subject-independent cross-validation details, label binarization thresholds, and comparison to random baselines) that the signals contain detectable patterns rather than noise or motion artifacts from VR use.
- Abstract: The claim that 'statistical analysis of the questionnaires confirmed that VR relaxation significantly reduced negative affect' lacks reported test statistics, p-values, effect sizes, or correction for multiple comparisons, making it impossible to evaluate whether the self-report labels are strong enough to serve as ground truth for the ML benchmarks.
- Abstract (and implied methods): No details are provided on data preprocessing (e.g., artifact removal for BVP/ECG in VR), participant exclusion criteria, exact demographics beyond age range, or the cross-validation scheme (subject-dependent vs. independent); these are load-bearing for reproducibility and for interpreting why performance remains low.
minor comments (2)
- Abstract: The phrasing 'F1-0 0.54 and F1-1 0.63' is unclear and appears to be missing formatting or an operator; revise to explicitly state F1-score for each class.
- Abstract: The final sentence on the relaxation task benchmark ('average F1-score 0.64, AUC 0.69') does not specify which signals or models were used, reducing clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We agree that additional details are needed to strengthen the claims regarding the dataset's utility for affect recognition and to ensure full reproducibility. We address each major comment below and will incorporate the requested clarifications and analyses in the revised version.
read point-by-point responses
-
Referee: Abstract: The reported best performance for binary valence classification from BVP data (F1=0.63, AUC=0.69 with CNN and CNN-Bi-GRU) is only marginally above chance; this undermines the central claim that the dataset supports affect recognition unless the paper demonstrates (via subject-independent cross-validation details, label binarization thresholds, and comparison to random baselines) that the signals contain detectable patterns rather than noise or motion artifacts from VR use.
Authors: We acknowledge that the reported benchmark performance is modest and only marginally above chance. In the revision, we will add explicit comparisons against random baselines (e.g., shuffled labels and majority-class predictors) to quantify the improvement. We will also detail the subject-independent cross-validation scheme (leave-one-subject-out), the exact binarization thresholds applied to the self-report valence and arousal scales (median split on the 1-9 SAM ratings), and additional preprocessing steps that mitigate VR motion artifacts in BVP/ECG. These additions will demonstrate that detectable affective patterns exist in the signals beyond noise. revision: yes
-
Referee: Abstract: The claim that 'statistical analysis of the questionnaires confirmed that VR relaxation significantly reduced negative affect' lacks reported test statistics, p-values, effect sizes, or correction for multiple comparisons, making it impossible to evaluate whether the self-report labels are strong enough to serve as ground truth for the ML benchmarks.
Authors: We agree that the abstract omits the supporting statistics. The full manuscript contains paired t-tests (or Wilcoxon signed-rank tests where normality assumptions were violated) showing significant reductions in negative affect (PANAS and SAM scales), with reported p-values, Cohen's d effect sizes, and Bonferroni correction for multiple comparisons across affect dimensions. We will move these statistics into the abstract and expand the methods section to confirm the self-report labels provide reliable ground truth for the benchmarks. revision: yes
-
Referee: Abstract (and implied methods): No details are provided on data preprocessing (e.g., artifact removal for BVP/ECG in VR), participant exclusion criteria, exact demographics beyond age range, or the cross-validation scheme (subject-dependent vs. independent); these are load-bearing for reproducibility and for interpreting why performance remains low.
Authors: We will substantially expand the methods section with the missing details: (1) preprocessing pipeline including bandpass filtering, peak detection, and artifact rejection for BVP/ECG (e.g., using signal quality indices to exclude segments with excessive VR head-motion artifacts); (2) participant exclusion criteria (e.g., incomplete sessions or poor signal quality leading to removal of X participants); (3) full demographics (mean age, gender distribution, handedness); and (4) confirmation that all ML benchmarks use subject-independent cross-validation. These revisions will directly address reproducibility and aid interpretation of the modest performance. revision: yes
Circularity Check
No significant circularity in empirical dataset and benchmarking paper
full rationale
The paper introduces a new multimodal dataset collected from 31 participants via wearable sensors during VR stress-relaxation protocols, validates affective state changes through self-report questionnaires and standard statistical tests, and reports benchmark results from off-the-shelf ML models (CNN, CNN-Bi-GRU, Transformer) on BVP/ECG/EDA signals. No derivation chain, first-principles predictions, fitted parameters renamed as outputs, or load-bearing self-citations exist; all claims are grounded in experimental data collection and public dataset release, which remain independently verifiable.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Validated self-report questionnaires accurately capture changes in affective states induced by VR stimuli
- domain assumption Wearable physiological signals (BVP, EDA, ECG) contain information usable for binary valence and arousal classification in immersive settings
Reference graph
Works this paper leans on
-
[1]
Empatica E4 Wristband for Research
2025. Empatica E4 Wristband for Research. https://www.empatica.com/research/e4/. Accessed: 2025-02-25
2025
-
[2]
Meta Quest 2 - Virtual Reality Headset
2025. Meta Quest 2 - Virtual Reality Headset. https://www.meta.com/ca/quest/. Accessed: 2025-01-19
2025
-
[3]
Polar H10 Heart Rate Sensor
2025. Polar H10 Heart Rate Sensor. https://www.polar.com/ca-en/sensors/h10-heart-rate-sensor. Accessed: 2025-02-26. Manuscript submitted to ACM Introducing WARM-VR: A Benchmark Dataset for Multimodal Wearable Affect Recognition in Virtual Reality 15
2025
-
[4]
Zeeshan Ahmad and Naimul Khan. 2022. A Survey on Physiological Signal-Based Emotion Recognition.Bioengineering9 (11 2022), 688. Issue 11. doi:10.3390/bioengineering9110688
-
[5]
Karim Alghoul, Hussein Al Osman, and Abdulmotaleb El Saddik. 2025. Enhancing Generalization in PPG-Based Emotion Measurement with a CNN-TCN-LSTM Model. In2025 IEEE International Instrumentation and Measurement Technology Conference (I2MTC). 1–6. doi:10.1109/I2MTC62753. 2025.11079085
-
[6]
Amanda A. Benbow and Page L. Anderson. 2019. A meta-analytic examination of attrition in virtual reality exposure therapy for anxiety disorders. Journal of Anxiety Disorders61 (1 2019), 18–26. doi:10.1016/j.janxdis.2018.06.006
-
[7]
Brandon Birckhead, Carine Khalil, Xiaoyu Liu, Samuel Conovitz, Albert Rizzo, Itai Danovitch, Kim Bullock, and Brennan Spiegel. 2019. Recommen- dations for Methodology of Virtual Reality Clinical Trials in Health Care by an International Working Group: Iterative Study.JMIR Mental Health6 (1 2019), e11973. Issue 1. doi:10.2196/11973
-
[8]
Measuring emotion: the self-assessment manikin and the semantic differential
Margaret M. Bradley and Peter J. Lang. 1994. Measuring emotion: The self-assessment manikin and the semantic differential.Journal of Behavior Therapy and Experimental Psychiatry25 (3 1994), 49–59. Issue 1. doi:10.1016/0005-7916(94)90063-9
-
[9]
Dayne R Camara and Richard E Hicks. [n. d.]. USING VIRTUAL REALITY TO REDUCE STATE ANXIETY AND STRESS IN UNIVERSITY STUDENTS: AN EXPERIMENT. ([n. d.]). doi:10.5176/2345-7929_4.2.100
-
[10]
Min Chen, Wenjing Xiao, Miao Li, Yixue Hao, Long Hu, and Guangming Tao. 2022. A Multi-feature and Time-aware-based Stress Evaluation Mechanism for Mental Status Adjustment.ACM Transactions on Multimedia Computing, Communications, and Applications18 (2 2022), 1–18. Issue 1s. doi:10.1145/3462763
-
[11]
Yaşar Daşdemir. 2022. Cognitive investigation on the effect of augmented reality-based reading on emotion classification performance: A new dataset.Biomedical Signal Processing and Control78 (9 2022). doi:10.1016/j.bspc.2022.103942
-
[12]
Yaşar Daşdemir. 2023. Classification of Emotional and Immersive Outcomes in the Context of Virtual Reality Scene Interactions.Diagnostics13 (11 2023). Issue 22. doi:10.3390/diagnostics13223437
-
[13]
Sébastien Grenier, Hélène Forget, Stéphane Bouchard, Sébastien Isere, Sylvie Belleville, Olivier Potvin, Marie Ève Rioux, and Mélissa Talbot. 2015. Using virtual reality to improve the efficacy of cognitive-behavioral therapy (CBT) in the treatment of late-life anxiety: preliminary recommendations for future research.International Psychogeriatrics27 (7 20...
-
[14]
Chongomweru Halimu, Asem Kasem, and S. H. Shah Newaz. 2019. Empirical Comparison of Area under ROC curve (AUC) and Mathew Correlation Coefficient (MCC) for Evaluating Machine Learning Algorithms on Imbalanced Datasets for Binary Classification. InProceedings of the 3rd International Conference on Machine Learning and Soft Computing(New York, NY, USA). ACM...
-
[15]
Miho Igarashi, Harumi Ikei, Chorong Song, and Yoshifumi Miyazaki. 2014. Effects of olfactory stimulation with rose and orange oil on prefrontal cortex activity.Complementary Therapies in Medicine22 (12 2014), 1027–1031. Issue 6. doi:10.1016/j.ctim.2014.09.003
-
[16]
Ye Ji Jin, Erkinov Habibilloh, Ye Seul Jang, Taejun An, Donghyun Jo, Saron Park, and Won Du Chang. 2022. A Photoplethysmogram Dataset for Emotional Analysis.Applied Sciences (Switzerland)12 (7 2022). Issue 13. doi:10.3390/app12136544
-
[17]
Sander Koelstra, Christian Mühl, Mohammad Soleymani, Jong Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2012. DEAP: A database for emotion analysis; Using physiological signals.IEEE Transactions on Affective Computing3 (1 2012), 18–31. Issue
2012
-
[18]
doi:10.1109/T-AFFC.2011.15
-
[19]
Krzysztof Kutt, Pawel Wegrzyn, Szymon Bobek, Grzegorz J Nalepa, Jan Argasi«ski, PaweªW¦grzyn, and Mateusz. [n. d.].Affective Computing Experiments in Virtual Reality with Wearable Sensors. Methodological considerations and preliminary results Aective Computing Experiments in Virtual Reality with Wearable Sensors. Methodological considerations and prelimin...
-
[20]
Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, and Luis A Leiva
Kayhan Latifzadeh, Nima Gozalpour, V. Javier Traver, Tuukka Ruotsalo, Aleksandra Kawala-Sterniuk, and Luis A Leiva. 2024. Efficient Decoding of Affective States from Video-elicited EEG Signals: An Empirical Investigation.ACM Transactions on Multimedia Computing, Communications, and Applications20 (10 2024), 1–24. Issue 10. doi:10.1145/3663669
-
[21]
Min Seop Lee, Yun Kyu Lee, Myo Taeg Lim, and Tae Koo Kang. 2020. Emotion recognition using convolutional neural network with selected statistical photoplethysmogram features.Applied Sciences (Switzerland)10 (5 2020). Issue 10. doi:10.3390/app10103501
-
[22]
Min Seop Lee, Yun Kyu Lee, Dong Sung Pae, Myo Taeg Lim, Dong Won Kim, and Tae Koo Kang. 2019. Fast emotion recognition based on single pulse PPG signal with convolutional neural network.Applied Sciences (Switzerland)9 (8 2019). Issue 16. doi:10.3390/app9163355
-
[23]
Dahua Li, Zhiyi Yang, Fazheng Hou, Qiaoju Kang, Shuang Liu, Yu Song, Qiang Gao, and Enzeng Dong. 2022. EEG-Based Emotion Recognition With Haptic Vibration by a Feature Fusion Method.IEEE Transactions on Instrumentation and Measurement71 (2022), 1–11. doi:10.1109/TIM.2022.3147882
-
[24]
Charles X. Ling, Jin Huang, and Harry Zhang. 2003.AUC: A Better Measure than Accuracy in Comparing Learning Algorithms. 329–341. doi:10.1007/3- 540-44886-1_25
work page doi:10.1007/3- 2003
-
[25]
Zhihan Lv, Fabio Poiesi, Qi Dong, Jaime Lloret, and Houbing Song. 2024. Special Issue on Deep Learning for Intelligent Human Computer Interaction. ACM Transactions on Multimedia Computing, Communications, and Applications20 (2 2024), 1–5. Issue 2. doi:10.1145/3605151
-
[26]
Nir Milstein and Ilanit Gordon. 2020. Validating Measures of Electrodermal Activity and Heart Rate Variability Derived From the Empatica E4 Utilized in Research Settings That Involve Interactive Dyadic States.Frontiers in Behavioral Neuroscience14 (8 2020). doi:10.3389/fnbeh.2020.00148
-
[27]
Hussein Al Osman, Haiwei Dong, and Abdulmotaleb El Saddik. 2016. Ubiquitous Biofeedback Serious Game for Stress Management.IEEE Access4 (2016), 1274–1286. doi:10.1109/ACCESS.2016.2548980 Manuscript submitted to ACM 16 K.Alghoul et al
-
[28]
Aasim Raheel, Muhammad Majid, and Syed Muhammad Anwar. 2021. DEAR-MULSEMEDIA: Dataset for emotion analysis and recognition in response to multiple sensorial media.Information Fusion65 (1 2021), 37–49. doi:10.1016/j.inffus.2020.08.007
-
[29]
Ammar Rashed, Shervin Shirmohammadi, Ihab Amer, and Mohamed Hefeeda. 2025. A Review of Player Engagement Estimation in Video Games: Challenges and Opportunities.ACM Transactions on Multimedia Computing, Communications, and Applications21 (7 2025), 1–33. Issue 7. doi:10.1145/3722116
-
[30]
Nafiul Rashid, Luke Chen, Manik Dautta, Abel Jimenez, Peter Tseng, and Mohammad Abdullah Al Faruque. 2021. Feature Augmented Hybrid CNN for Stress Recognition Using Wrist-based Photoplethysmography Sensor. InProceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS. Institute of Electrical and Electro...
-
[31]
Jacob Rodrigues, Octavian Postolache, and Francisco Cercas
Mariana C. Jacob Rodrigues, Octavian Postolache, and Francisco Cercas. 2023. The Influence of Stress Noise and Music Stimulation on the Autonomous Nervous System.IEEE Transactions on Instrumentation and Measurement72 (2023), 1–19. doi:10.1109/TIM.2023.3279881
-
[32]
MASAHITO SAKAKIBARA, SATOSHI TAKEUCHI, and JUNICHIRO HAYANO. 1994. Effect of relaxation training on cardiac parasympathetic tone. Psychophysiology31 (5 1994), 223–228. Issue 3. doi:10.1111/j.1469-8986.1994.tb02210.x
-
[33]
Seyed Salehizadeh, Duy Dao, Jeffrey Bolkhovsky, Chae Cho, Yitzhak Mendelson, and Ki Chon. 2015. A Novel Time-Varying Spectral Filtering Algo- rithm for Reconstruction of Motion Artifact Corrupted Heart Rate Signals During Intense Physical Activities Using a Wearable Photoplethysmogram Sensor.Sensors16 (12 2015), 10. Issue 1. doi:10.3390/s16010010
-
[34]
Philip Schmidt, Attila Reiss, Robert Duerichen, and Kristof Van Laerhoven. 2018. Introducing WeSAD, a multimodal dataset for wearable stress and affect detection. InICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction. Association for Computing Machinery, Inc, 400–408. doi:10.1145/3242969.3242985
-
[35]
Monireh (Monica) Vahdati, Fedwa Laamarti, and Abdulmotaleb El Saddik. 2025. Meta-Review of Wearable Devices for Healthcare in the Metaverse. ACM Transactions on Multimedia Computing, Communications, and Applications21 (7 2025), 1–36. Issue 7. doi:10.1145/3705320
-
[36]
Yasmin Elsaddik Valdivieso, Mohd Faisal, Karim Alghoul, Monireh Monica Vahdati, Kamran Gholizadeh Hamlabadi, Fedwa Laamarti, Hussein Al Osman, and Abdulmotaleb El Saddik. 2025. The Potential of Olfactory Stimuli in Stress Reduction Through Virtual Reality. In2025 IEEE Medical Measurements & Applications (MeMeA). IEEE, 1–6. doi:10.1109/MeMeA65319.2025.11068102
-
[37]
Jingyi Xue, Jinqin Wang, Shiang Hu, Ning Bi, and Zhao Lv. 2022. OVPD: Odor-Video Elicited Physiological Signal Database for Emotion Recognition. IEEE Transactions on Instrumentation and Measurement71 (2022). doi:10.1109/TIM.2022.3149116
-
[38]
Guanghao Yin, Shouqian Sun, Dian Yu, Dejian Li, and Kejun Zhang. 2022. A Multimodal Framework for Large-Scale Emotion Recognition by Fusing Music and Electrodermal Activity Signals.ACM Transactions on Multimedia Computing, Communications, and Applications18 (8 2022), 1–23. Issue 3. doi:10.1145/3490686
-
[39]
Xinjie Zhang, Tenggan Zhang, Lei Sun, Jinming Zhao, and Qin Jin. 2025. Exploring Interpretability in Deep Learning for Affective Computing: A Comprehensive Review.ACM Transactions on Multimedia Computing, Communications, and Applications21 (7 2025), 1–28. Issue 7. doi:10.1145/3723005
-
[40]
Sicheng Zhao, Amir Gholaminejad, Guiguang Ding, Yue Gao, Jungong Han, and Kurt Keutzer. 2019. Personalized Emotion Recognition by Personality-Aware High-Order Learning of Physiological Signals.ACM Transactions on Multimedia Computing, Communications, and Applications 15 (1 2019), 1–18. Issue 1s. doi:10.1145/3233184
-
[41]
Sicheng Zhao, Shangfei Wang, Mohammad Soleymani, Dhiraj Joshi, and Qiang Ji. 2019. Affective Computing for Large-scale Heterogeneous Multimedia Data.ACM Transactions on Multimedia Computing, Communications, and Applications15 (11 2019), 1–32. Issue 3s. doi:10.1145/3363560
-
[42]
Junjie Zhu, Yuxuan Wei, Yifan Feng, Xibin Zhao, and Yue Gao. 2019. Physiological Signals-based Emotion Recognition via High-order Correlation Learning.ACM Transactions on Multimedia Computing, Communications, and Applications15 (11 2019), 1–18. Issue 3s. doi:10.1145/3332374 Received 20 February 2007; revised 12 March 2009; accepted 5 June 2009 Manuscript ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.