arxiv: 2604.05926 · v1 · submitted 2026-04-07 · 💻 cs.HC

Recognition: no theorem link

FEEL: Quantifying Heterogeneity in Physiological Signals for Generalizable Emotion Recognition

Pragya Singh , Ankush Gupta , Somay Jalan , Mohan Kumar , Pushpendra Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:20 UTC · model grok-4.3

classification 💻 cs.HC

keywords emotion recognitionphysiological signalsEDAPPGbenchmarkinggeneralizationheterogeneitymachine learning

0 comments

The pith

Evaluating 16 model types on 19 datasets shows that contrastive pretraining plus handcrafted features handles variation in devices and settings best for physiological emotion recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes how differences across experimental conditions, sensor hardware, and annotation methods limit the reliability of emotion classification from EDA and PPG signals. It does this by running a standardized set of traditional, deep, and self-supervised models on the same 19 public datasets using both within-dataset and cross-dataset test protocols. A sympathetic reader would care because applications in mental health monitoring or adaptive interfaces require models that continue to work when moved from one study setup to another. The results indicate that domain-informed features and contrastive signal-language pretraining improve transfer, while raw-signal models degrade more quickly under heterogeneity.

Core claim

Through the FEEL benchmark we show that fine-tuned contrastive signal-language pretraining models achieve the highest F1 scores across arousal and valence tasks in 71 of 114 evaluations, simpler models such as Random Forests, LDA, and MLP stay competitive in 36 evaluations, and handcrafted-feature models outperform raw-segment models in 107 of 114 cases. Cross-dataset tests further establish that models trained on real-life recordings transfer to laboratory and constraint-based data with F1 scores of 0.79 and 0.78, expert-labeled training transfers to stimulus-labeled and self-reported data with F1 scores of 0.72 and 0.76, and lab-device models transfer to custom wearables and the Empatica E

What carries the argument

The FEEL cross-dataset evaluation protocol that measures within- and between-dataset performance of 16 architectures on 19 public EDA and PPG datasets for arousal and valence classification.

If this is right

Training on real-life setting data produces models that reach F1 of 0.79 on lab settings and 0.78 on constraint-based settings.
Expert-annotated training data transfers to stimulus-labeled datasets at F1 0.72 and to self-reported datasets at F1 0.76.
Models trained on laboratory devices transfer to custom wearable devices at F1 0.81 and to the Empatica E4 at F1 0.73.
Handcrafted features remain essential for performance in low-resource, noisy physiological signal environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of consumer wearable apps could reduce retraining costs by adopting the handcrafted-feature pipelines and contrastive pretraining shown to transfer across device types.
Future studies should include streaming or continuous-recording test sets to check whether the observed cross-dataset robustness holds in real-time deployment.
The transfer patterns suggest that standardizing a small set of physiologically motivated features may be more practical than collecting ever-larger labeled datasets for every new sensor.

Load-bearing premise

The 19 selected public datasets together capture enough of the real variation in devices, settings, and labeling practices to support conclusions about generalization to new conditions.

What would settle it

Train the reported top models on the existing datasets and test them on a newly collected EDA/PPG dataset recorded with an unseen wearable in everyday conditions using only self-reported labels; the claim fails if F1 on both arousal and valence drops below the within-dataset levels shown in the paper.

Figures

Figures reproduced from arXiv: 2604.05926 by Ankush Gupta, Mohan Kumar, Pragya Singh, Pushpendra Singh, Somay Jalan.

**Figure 1.** Figure 1: Our CLSP Fine-Tuning Approach, consists of lightweight neural network (Meta-Net) that generates for each signal segment an input-conditional context token. representations without requiring ground-truth text annotations in the target datasets. The modulation networks were designed in two architectural variants: one using two linear layers, and another employing two stacked 1D convolutional layers applied t… view at source ↗

**Figure 2.** Figure 2: UMAP projection of Electrodermal Activity (EDA) and Photoplethysmography (PPG) [PITH_FULL_IMAGE:figures/full_fig_p027_2.png] view at source ↗

**Figure 3.** Figure 3: Comparative performance (F1 score) of the best-performing models per dataset across three [PITH_FULL_IMAGE:figures/full_fig_p034_3.png] view at source ↗

**Figure 4.** Figure 4: Comparative performance (F1 score) of the best-performing models for four-class clas [PITH_FULL_IMAGE:figures/full_fig_p034_4.png] view at source ↗

**Figure 5.** Figure 5: Benchmarking results for Arousal Classification on EDA signal Data across 19 datasets for [PITH_FULL_IMAGE:figures/full_fig_p036_5.png] view at source ↗

**Figure 6.** Figure 6: Benchmarking results for Valence Classification on EDA signal Data across 19 datasets for [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗

**Figure 7.** Figure 7: Benchmarking results for Arousal Classification on PPG signal Data across 19 datasets for [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

**Figure 8.** Figure 8: Benchmarking results for Valence Classification on PPG signal Data across 19 datasets for [PITH_FULL_IMAGE:figures/full_fig_p037_8.png] view at source ↗

**Figure 9.** Figure 9: Benchmarking results for Arousal Classification on EDA+PPG Data across 19 datasets for [PITH_FULL_IMAGE:figures/full_fig_p038_9.png] view at source ↗

**Figure 10.** Figure 10: Benchmarking results for Valence Classification on EDA+PPG Data across 19 datasets for [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: Benchmarking results: EDA only. This bubble plot illustrates the impact of EDA signals on F1 performance (best model) for arousal and valence classification across 19 datasets. WESAD NURSE EMOGNITION UBFC_PHYS PhyMER EmoWear MAUS CLAS CASE Unobtrusive CEAP-360VR ScientISST MOVE Dapper ForDigitStress ADARP Exercise MOCAS LAUREATE VERBIO F1 Score - Arousal (PPG) F1 Score - Valence (PPG) 0.00 0.25 0.50 0.75 … view at source ↗

**Figure 12.** Figure 12: Benchmarking results: PPG only. This bubble plot illustrates the impact of PPG signals on F1 performance (best model) for arousal and valence classification across 19 datasets. 39 [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗

**Figure 13.** Figure 13: Benchmarking results: EDA + PPG combined. This bubble plot illustrates the impact of combining EDA and PPG signals on F1 performance (best model) for arousal and valence classification across 19 datasets. A.8.2 Cross-Data Analysis Results In this section, we present our results for cross-dataset analysis as detailed in Tables 15, 16 for device dimension, Tables 13, 14 for setting dimension, and Tables 17,… view at source ↗

**Figure 14.** Figure 14: UMPA Visualization of EDA and PPG Features color coded by Labeling Techniques [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

**Figure 15.** Figure 15: UMPA Visualization of EDA and PPG Features color coded by Device Type. Note: e4 [PITH_FULL_IMAGE:figures/full_fig_p043_15.png] view at source ↗

**Figure 16.** Figure 16: UMPA Visualization of EDA and PPG Features color coded by Experiment Collection [PITH_FULL_IMAGE:figures/full_fig_p044_16.png] view at source ↗

read the original abstract

Emotion recognition from physiological signals has substantial potential for applications in mental health and emotion-aware systems. However, the lack of standardized, large-scale evaluations across heterogeneous datasets limits progress and model generalization. We introduce FEEL, the first large-scale benchmarking study of emotion recognition using electrodermal activity (EDA) and photoplethysmography (PPG) signals across 19 publicly available datasets. We evaluate 16 architectures spanning traditional machine learning, deep learning, and self-supervised pretraining approaches, structured into four representative modeling paradigms. Our study includes both within-dataset and cross-dataset evaluations, analyzing generalization across variations in experimental settings, device types, and labeling strategies. Our results showed that fine-tuned contrastive signal-language pretraining (CLSP) models (71/114) achieve the highest F1 across arousal and valence classification tasks, while simpler models like Random Forests, LDA, and MLP remain competitive (36/114). Models leveraging handcrafted features (107/114) consistently outperform those trained on raw signal segments, underscoring the value of domain knowledge in low-resource, noisy settings. Further cross-dataset analyses reveal that models trained on real-life setting data generalize well to lab (F1 = 0.79) and constraint-based settings (F1 = 0.78). Similarly, models trained on expert-annotated data transfer effectively to stimulus-labeled (F1 = 0.72) and self-reported datasets (F1 = 0.76). Moreover, models trained on lab-based devices also demonstrated high transferability to both custom wearable devices (F1 = 0.81) and the Empatica E4 (F1 = 0.73), underscoring the influence of heterogeneity. More information about FEEL can be found on our website https://alchemy18.github.io/FEEL_Benchmark/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FEEL gives the first wide cross-dataset benchmark for EDA/PPG emotion recognition, but its transfer and ranking claims sit on raw point estimates without variance or tests.

read the letter

The main takeaway is that this paper runs the broadest comparison yet across 19 public datasets and 16 model types for arousal and valence from EDA and PPG. It reports concrete within-dataset F1 wins and some cross-setting transfer numbers, which is new for this domain where most prior work sticks to one or two datasets at a time. Handcrafted features beating raw segments in 107 of 114 cases and the CLSP models topping the counts are useful signals for people picking approaches in noisy, low-resource settings. The transfer results, such as real-life trained models reaching 0.79 F1 on lab data or lab devices transferring at 0.81 to custom wearables, give practical hints about what might hold up outside single experiments. That scale and the split into four modeling paradigms are the parts that actually move the field forward. The soft spots are exactly where the stress-test note flags them. All the headline numbers are single F1 values with no confidence intervals, no dataset-size weighting, and no paired tests to show whether the differences between models or the transfer gaps are reliable. When datasets differ in size, balance, and noise, a few large clean ones can easily tilt the aggregates, so the claim that CLSP or handcrafted features are broadly superior needs more backing. The assumption that these 19 sets cover the key variations in devices and labeling is stated but not deeply justified in the reported results. This paper is for researchers who build or select physiological emotion models and want a reference point for generalization questions. It is not a finished theoretical advance, but the empirical scope is large enough that a serious referee could tighten the statistics and still leave a usable benchmark. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces FEEL, the first large-scale benchmarking study of emotion recognition from EDA and PPG signals across 19 public datasets. It evaluates 16 architectures spanning traditional ML, deep learning, and self-supervised pretraining in four paradigms, reporting within- and cross-dataset results. Key findings include fine-tuned CLSP models achieving highest F1 in 71/114 cases, handcrafted-feature models outperforming raw-signal models in 107/114 cases, and cross-dataset transfer F1 scores of 0.79 (real-life to lab), 0.78 (to constraint-based), 0.72-0.76 (annotation types), and 0.73-0.81 (device types).

Significance. If statistically supported, this provides a valuable reference benchmark quantifying heterogeneity and generalization in physiological emotion recognition, with practical implications for mental health and affective computing applications. The scale (19 datasets, multiple paradigms) and public website are strengths that could guide future work on robust models.

major comments (2)

[Results] Results section: The central claims of CLSP superiority (71/114 wins) and handcrafted-feature outperformance (107/114) rest on aggregated win counts of F1 scores without reported confidence intervals, paired statistical tests (e.g., McNemar or Wilcoxon), or weighting by dataset size/class balance. Given the heterogeneity in the 19 datasets, these aggregates may not reliably establish consistent superiority.
[Cross-dataset evaluation] Cross-dataset transfer subsection: The reported transfer F1 scores (e.g., 0.79 real-life to lab, 0.81 lab-device to custom wearable) are single point estimates with no per-pair variances, standard errors, or significance tests against within-dataset baselines. This is load-bearing for the generalization claims across experimental settings, labeling strategies, and device types.

minor comments (2)

[Abstract] Abstract: The parenthetical counts (71/114, 107/114) would benefit from a brief parenthetical explanation of what the denominator represents (e.g., total dataset-task combinations) to improve immediate readability.
[Methods] The manuscript could add a summary table listing the 19 datasets with key metadata (size, setting, device, labeling method) to support the heterogeneity analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of the paper's significance and for the constructive major comments. We address each point below, agreeing that additional statistical rigor can strengthen the presentation of our results.

read point-by-point responses

Referee: [Results] Results section: The central claims of CLSP superiority (71/114 wins) and handcrafted-feature outperformance (107/114) rest on aggregated win counts of F1 scores without reported confidence intervals, paired statistical tests (e.g., McNemar or Wilcoxon), or weighting by dataset size/class balance. Given the heterogeneity in the 19 datasets, these aggregates may not reliably establish consistent superiority.

Authors: We appreciate this observation. The win counts serve as an intuitive aggregate metric to summarize performance across 114 classification tasks (arousal and valence from 19 datasets, with multiple models). Given the substantial heterogeneity in dataset sizes, experimental conditions, and class balances, we chose not to weight by size to avoid biasing toward larger datasets. However, we agree that confidence intervals and statistical tests would enhance the claims. In the revision, we will add bootstrap-derived 95% confidence intervals for the win proportions and perform non-parametric paired tests (e.g., Wilcoxon signed-rank on per-task F1 differences) to assess if the observed superiorities are statistically significant. This addresses the concern about reliability amid heterogeneity. revision: yes
Referee: [Cross-dataset evaluation] Cross-dataset transfer subsection: The reported transfer F1 scores (e.g., 0.79 real-life to lab, 0.81 lab-device to custom wearable) are single point estimates with no per-pair variances, standard errors, or significance tests against within-dataset baselines. This is load-bearing for the generalization claims across experimental settings, labeling strategies, and device types.

Authors: Thank you for highlighting this. The reported F1 values are averages over the multiple cross-dataset transfer pairs within each category (e.g., all real-life to lab transfers). To provide more transparency, we will include the standard deviation across these pairs and report the number of pairs for each aggregate in the revised manuscript. Additionally, we will conduct significance tests comparing the cross-dataset F1 to the corresponding within-dataset baselines using appropriate paired tests. This will better substantiate the generalization findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential reductions

full rationale

The paper conducts a large-scale empirical evaluation of 16 model architectures (traditional ML, deep learning, self-supervised) on 19 public EDA/PPG datasets for arousal/valence classification. All reported results consist of direct F1 scores from within- and cross-dataset training/evaluation runs; no equations, first-principles derivations, fitted parameters presented as predictions, or uniqueness theorems appear. Aggregated counts (71/114, 107/114) and transfer F1 values are simple tallies of experimental outcomes, not reductions to inputs by construction. No self-citations are invoked to justify load-bearing premises, and the study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no theoretical derivations, new postulates, or invented entities. All results rest on existing public datasets and standard ML pipelines.

pith-pipeline@v0.9.0 · 5645 in / 1104 out tokens · 37312 ms · 2026-05-10T18:20:45.660621+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Impact of Validation Strategy on Machine Learning Performance in EEG-Based Alcoholism Classification
eess.SP 2026-04 unverdicted novelty 4.0

Nested cross-validation reveals optimistic bias in standard validation for EEG alcoholism classification, with AdaBoost reaching 78.3% accuracy and most model differences not statistically significant per McNemar's test.

Reference graph

Works this paper leans on

10 extracted references · 2 canonical work pages · cited by 1 Pith paper

[1]

doi: 10.1038/s41597-021-00945-4

ISSN 2052-4463. doi: 10.1038/s41597-021-00945-4. URL https://www.nature.com/ articles/s41597-021-00945-4. Publisher: Nature Publishing Group. Jainendra Shukla, Miguel Barreda-Ángeles, Joan Oliver, G. C. Nandi, and Domènec Puig. Feature Ex- traction and Selection for Emotion Recognition from Electrodermal Activity.IEEE Transactions on Affective Computing, ...

work page doi:10.1038/s41597-021-00945-4 2052
[2]

URL https://www.ahajournals.org/doi/abs/ 10.1161/CIRCEP.111.964973

doi: 10.1161/CIRCEP.111.964973. URL https://www.ahajournals.org/doi/abs/ 10.1161/CIRCEP.111.964973. Rui Wang, Fanglin Chen, Zhenyu Chen, Tianxing Li, Gabriella Harari, Stefanie Tignor, Xia Zhou, Dror Ben-Zeev, and Andrew T. Campbell. StudentLife: assessing mental health, academic performance and behavioral trends of college students using smartphones. InP...

work page doi:10.1161/circep.111.964973 2014
[3]

0_back" were considered low in cognitive demand and thus assigned low arousal and positive 20 valence. Conversely, tasks labeled as

Institutional review board (IRB) approvals or equivalent for research with human subjects 17 Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country ...

2018
[4]

• Low Arousal:

Textual Prompts for Arousal Classification: • High Arousal:"The participant felt a strong physical reaction, like a racing heart or tense body, and experienced high-energy emotions such as excitement, enthusiasm, surprise, anger, and nervousness." • Low Arousal:"The participant felt low energy and relaxed, with calm emotions like peacefulness, relaxation,...
[5]

The participant felt bad and was in a negative mood, with emotions like sadness, fear, anger, worry, hopelessness, and frustration

Textual Prompts for Valence Classification: • Negative Valence:"The participant felt bad and was in a negative mood, with emotions like sadness, fear, anger, worry, hopelessness, and frustration." • Positive Valence:"The participant experienced a positive mood characterized by emotions such as happiness, joy, gratitude, serenity, interest, hope, pride, am...
[6]

Strong physical reaction with intense negative emotions like anger, fear, frustration, anxiety, or panic

Textual Prompts for Four Class Classification: • High Arousal Negative Valence:"Strong physical reaction with intense negative emotions like anger, fear, frustration, anxiety, or panic." • High Arousal Positive Valence:"Strong physical activation with energizing positive emotions like joy, enthusiasm, exhilaration, or amusement." • Low Arousal Negative Va...
[7]

Cross-Setting Generalization is Asymmetric but Promising • Real → Lab/Constraint:Models trained on real-world data generalized well to lab and constraint-based settings (F1 up to 0.79 for valence), particularly with CLSP-based models and EDA input. • Constraint → Real:Constraint-trained models showed strong transferability to real-world data, achieving th...
[8]

Cross-Label Generalization Benefits from High-Quality Annotations 32 • Expert-Annotated → Self/Stimulus:Expert-annotated training led to strong generalization across label types (F1 up to 0.76 for valence), showing that high-quality temporal labels help bridge subjective and task-derived annotations. • Self-report → Expert/Stimulus:Surprisingly strong per...
[9]

Cross-Device Generalization is Strongest with High-Fidelity Sensors • Wearable → Custom:Training on commercial wearable devices transferred well to custom wearables (F1 up to 0.82 for valence), especially using CLSP models, suggesting sensor fidelity supports robust feature learning. • Wearable → Lab-Based Device: Transfer from Wearable (E4) to lab-based ...
[10]

Influence of Demographics on Generalization:Cross-demographic evaluations indicate strong transferability across gender and age for valence classification (F1 = 0.71 and 0.73, respectively). In contrast, arousal transfer remains weak, nearing random performance, implying that arousal- related physiological patterns are more susceptible to individual varia...