pith. machine review for the scientific record. sign in

arxiv: 2604.12671 · v2 · submitted 2026-04-14 · 🧬 q-bio.QM · eess.SP

Recognition: unknown

Differentiating Physical and Psychological Stress Using Wearable Physiological Signals and Salivary Cortisol

George G. Malliaras, Marco Vinicio Alban-Paccha, Nikoletta Athanassopoulou, Ozan Kaya

Pith reviewed 2026-05-10 13:56 UTC · model grok-4.3

classification 🧬 q-bio.QM eess.SP
keywords wearable sensorssalivary cortisolstress classificationphysiological signalspsychological stressmultimodal monitoringmachine learningstress recovery
0
0 comments X

The pith

Wearable signals alone cannot reliably separate psychological stress from rest, but adding salivary cortisol lifts overall classification accuracy to 94.4 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study tests whether standard wearable measurements of heart rate, heart rate variability, skin conductance, and movement can distinguish physical stress, psychological stress, and their recovery phases from each other and from rest. In sessions with six adults, these signals identified physical stress and recovery well but often confused psychological stress with resting states. Adding salivary cortisol levels sampled at five points per session allowed a machine learning model to correctly label psychological stress and recovery far more often while cutting down on rest misclassifications. The combined approach reached 94.4 percent overall accuracy under leave-one-participant-out testing. This outcome indicates that physiological wearables may require direct endocrine context to monitor mental stress effectively.

Core claim

A gradient boosting classifier trained on 10-minute windows of heart rate, heart rate variability, electrodermal activity, and wrist accelerometry data achieved 77.8 percent accuracy when labeling rest, physical stress, physical recovery, psychological stress, or psychological recovery. Adding five cortisol features per window raised overall accuracy to 94.4 percent, with psychological stress recall rising from 50.0 percent to 83.3 percent and psychological recovery recall rising from 54.2 percent to 87.5 percent, while also lowering confusion between psychological stress and rest.

What carries the argument

Gradient boosting classifier operating on features extracted from wearable physiological signals and salivary cortisol in non-overlapping 10-minute windows, evaluated via leave-one-participant-out cross-validation.

If this is right

  • Wearable signals alone perform well for physical stress and recovery but frequently misclassify psychological stress as rest.
  • Including cortisol data improves recall for both psychological stress and psychological recovery states.
  • Cortisol integration reduces overlap between psychological stress and rest labels.
  • Multimodal sensing supports more reliable stress-type monitoring than physiology alone.
  • The results motivate larger studies and development of practical alternatives to repeated cortisol sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Non-invasive proxies for cortisol could extend wearable mental-stress tracking into continuous daily use.
  • Better separation of stress types may improve real-time guidance in fitness and mental-health applications.
  • Field tests outside the lab would reveal whether the observed accuracy gains hold when stress arises naturally.
  • Device designs might combine standard sensors with other biomarkers to achieve similar classification without saliva collection.

Load-bearing premise

The laboratory protocols of high-intensity cycling and a modified social stress test induce stress responses representative of everyday conditions that can be accurately labeled and generalized from only six participants.

What would settle it

A follow-up study with more participants performing real-world activities, with stress types independently confirmed by additional measures, that checks whether the accuracy gain from adding cortisol persists or falls substantially.

Figures

Figures reproduced from arXiv: 2604.12671 by George G. Malliaras, Marco Vinicio Alban-Paccha, Nikoletta Athanassopoulou, Ozan Kaya.

Figure 1
Figure 1. Figure 1: Study instrumentation and experimental protocol. (A) Wearable and biospecimen measures, including chest electrocardiography (ECG), wrist-based heart rate, wrist electrodermal activity (EDA) and accelerometry, and salivary cortisol sampling. (B) Protocol timeline for the three experimental sessions: rest (baseline), physical stress (high-intensity cycling), and psychological stress (modified Trier Social St… view at source ↗
Figure 2
Figure 2. Figure 2: Self-reported stress ratings during the TSST and baseline session for each of the six participants. Physiological and Cortisol Responses Across Condi￾tions [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Temporal profiles of physiological signals and salivary cortisol during rest, physical stress, and psychological stress. From top to bottom: heart rate (HR), HRV (RMSSD), EDA tonic skin conductance, and salivary cortisol (baseline-corrected). Time 0 marks task onset [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion matrices for five-class stress classification (A) using wearable physiological features alone and (B) after inclusion of salivary cortisol features. Values are row-normalised proportions (each row sums to 1.0). The colour scale ranges from 0.0 to 1.0, with darker shading indicating higher proportions. and tonic EDA, remained the dominant contributors for clas￾sifying physical stress and recovery.… view at source ↗
read the original abstract

Objective: This study aimed to assess how wearable physiological signals, alone and combined with salivary cortisol, distinguish physical and psychological stress and their recovery states. Methods: Six healthy adults completed three laboratory sessions on separate days: rest, physical stress (high-intensity cycling), or psychological stress (modified Trier Social Stress Test). Heart rate, heart rate variability, electrodermal activity, and wrist accelerometry were recorded continuously, and salivary cortisol was sampled at five time points. Features were extracted in non-overlapping 10-minute windows and labelled as rest, physical stress, physical recovery, psychological stress, or psychological recovery. A gradient boosting classifier was trained using wearable features alone and with five additional cortisol features per window. Performance was evaluated using leave-one-participant-out cross-validation. Results: Wearable-only classification achieved 77.8% overall accuracy, with high accuracy for physical stress and recovery but frequent misclassification of psychological stress and recovery (recall 50.0% and 54.2%). Including cortisol improved overall accuracy (94.4%), particularly for psychological states, increasing recall to 83.3% and 87.5%. Cortisol also reduced misclassification between psychological stress and rest. Conclusion: Wearable signals alone were insufficient to reliably distinguish psychological stress from rest and recovery. Integrating salivary cortisol improved classification of psychological stress and recovery and reduced confusion with rest, highlighting the value of endocrine context alongside wearable physiology. Significance: These findings support multimodal stress monitoring and motivate larger, ecologically valid studies and scalable alternatives to repeated cortisol sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a laboratory study with n=6 healthy adults who completed separate sessions of rest, high-intensity cycling (physical stress), and modified Trier Social Stress Test (psychological stress). Wearable signals (HR, HRV, EDA, accelerometry) were recorded continuously and salivary cortisol sampled at five time points; features were extracted in non-overlapping 10-min windows and labeled into five classes. A gradient-boosting classifier achieved 77.8% overall accuracy using wearable features alone (high accuracy on physical states but only 50.0% and 54.2% recall on psychological stress and recovery). Adding five cortisol-derived features per window raised overall accuracy to 94.4%, with psychological-state recall improving to 83.3% and 87.5% and reduced confusion between psychological stress and rest. The authors conclude that wearable signals alone are insufficient for reliable psychological-stress differentiation and that endocrine context adds value.

Significance. If the reported accuracy lift is statistically reliable and generalizable, the work provides concrete evidence that salivary cortisol supplies information orthogonal to standard wearable physiology for distinguishing psychological stress/recovery from rest. This supports the broader case for multimodal stress monitoring and supplies a clear quantitative motivation for developing scalable cortisol proxies or larger ambulatory studies.

major comments (3)
  1. [Results] Results: The 16.6-point accuracy gain (77.8% → 94.4%) and the recall improvements for psychological states are reported as single point estimates with no accompanying standard deviation across LOOCV folds, confidence intervals, or paired statistical test comparing the two feature sets on identical folds. With n=6 this omission leaves the central quantitative claim without statistical grounding.
  2. [Methods] Methods: Leave-one-participant-out cross-validation on a cohort of only six participants yields an effective sample size of six for estimating inter-individual variability; the manuscript provides no analysis of fold-wise variance, participant-specific performance, or discussion of whether the observed lift could arise from session-order or label-noise effects.
  3. [Methods] Methods: The exact definition and temporal alignment of the five cortisol features per 10-min window, as well as the rule used to assign recovery labels to windows following the cycling and mTSST protocols, are not described in sufficient detail to allow independent reproduction or assessment of label noise.
minor comments (2)
  1. [Abstract] The abstract contains a separate “Significance” paragraph; this is unconventional and could be folded into the Conclusion for clarity.
  2. [Results] The total number of 10-min windows per class and the class-balance statistics are not reported, making it difficult to interpret the absolute meaning of the 50.0% and 54.2% recall figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important issues of statistical rigor and reproducibility in our small-cohort study. We agree that additional reporting is needed and will revise the manuscript to address each point.

read point-by-point responses
  1. Referee: [Results] Results: The 16.6-point accuracy gain (77.8% → 94.4%) and the recall improvements for psychological states are reported as single point estimates with no accompanying standard deviation across LOOCV folds, confidence intervals, or paired statistical test comparing the two feature sets on identical folds. With n=6 this omission leaves the central quantitative claim without statistical grounding.

    Authors: We agree that point estimates alone are insufficient. In the revised Results section we will report the mean and standard deviation of accuracy (and per-class recall) across the six LOOCV folds for both models. We will also add bootstrap-derived 95% confidence intervals and a paired Wilcoxon signed-rank test on the per-fold accuracies to assess the significance of the improvement. We will explicitly note the limited statistical power inherent to n=6 as a limitation. revision: yes

  2. Referee: [Methods] Methods: Leave-one-participant-out cross-validation on a cohort of only six participants yields an effective sample size of six for estimating inter-individual variability; the manuscript provides no analysis of fold-wise variance, participant-specific performance, or discussion of whether the observed lift could arise from session-order or label-noise effects.

    Authors: We will add a supplementary table (or expanded Results paragraph) listing accuracy and recall for each of the six participants/folds under both feature sets, thereby exposing fold-wise variance and inter-individual differences. We will also insert a dedicated Limitations paragraph that discusses possible session-order effects (sessions occurred on separate days) and any detectable label-assignment variability, while acknowledging that the small sample precludes definitive exclusion of such confounds. revision: yes

  3. Referee: [Methods] Methods: The exact definition and temporal alignment of the five cortisol features per 10-min window, as well as the rule used to assign recovery labels to windows following the cycling and mTSST protocols, are not described in sufficient detail to allow independent reproduction or assessment of label noise.

    Authors: We apologize for the lack of detail. The revised Methods section will specify the exact computation of each of the five cortisol features, the interpolation method used to align the five salivary samples with the 10-min windows, and the precise temporal criteria for labeling recovery windows after each protocol (cycling and mTSST). These additions will permit reproduction and evaluation of potential label noise. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML classification on experimental data

full rationale

The paper is a supervised machine-learning study that collects physiological and cortisol data from six participants across controlled lab sessions, extracts features from 10-minute windows, trains a gradient boosting classifier, and reports cross-validated accuracies. No derivations, first-principles predictions, or equations are present. Performance numbers (77.8% wearable-only, 94.4% with cortisol) are direct empirical outputs of leave-one-participant-out cross-validation on labeled data; they do not reduce to any fitted parameter or self-defined quantity by construction. No self-citations support uniqueness theorems, ansatzes, or load-bearing premises. The work is self-contained against external benchmarks (the collected dataset and standard ML pipeline) and contains no renaming of known results or smuggling of assumptions via citation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The claim rests on standard machine-learning assumptions plus the domain premise that lab-induced stresses produce distinct, labelable physiological and endocrine signatures; no new entities are postulated and only one explicit free parameter (window duration) is visible.

free parameters (1)
  • analysis window size = 10 minutes
    10-minute non-overlapping windows chosen for feature extraction; directly affects which temporal dynamics are captured and is not derived from data.
axioms (2)
  • domain assumption Laboratory stress induction protocols produce distinct and representative physical versus psychological stress responses suitable for supervised labeling.
    This premise enables the five-class labeling used to train and evaluate the classifier.
  • domain assumption Extracted features from heart rate, heart-rate variability, electrodermal activity, and accelerometry plus cortisol samples are sufficient to discriminate the target states.
    Underlies the feature-based classification approach without further justification in the abstract.

pith-pipeline@v0.9.0 · 5598 in / 1642 out tokens · 74559 ms · 2026-05-10T13:56:26.938454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references

  1. [1]

    B. S. McEwen. Stress, adaptation, and disease. allostasis and allostatic load.Ann. N. Y . Acad. Sci., 840:33–44, 1998

  2. [2]

    J. M. Koolhaas et al. Stress revisited: a critical evaluation of the stress concept.Neurosci. Biobehav. Rev., 35(5):1291–1301, 2011

  3. [3]

    Shiffman, A

    S. Shiffman, A. A. Stone, and M. R. Hufford. Ecological momentary assessment.Annu. Rev. Clin. Psychol., 4:1–32, 2008

  4. [4]

    Smyth, M

    J. Smyth, M. Zawadzki, and W. Gerin. Stress and disease: a structural and functional analysis.Soc. Pers. Psychol. Compass, 7(4):217–227, 2013

  5. [5]

    G. M. Slavich. Psychoneuroimmunology of stress and mental health. InThe Oxford Hand- book of Stress and Mental Health, pages 519–545. Oxford Univ. Press, New Y ork, NY , USA, 2020

  6. [6]

    Cohen, T

    S. Cohen, T. Kamarck, and R. Mermelstein. A global measure of perceived stress.J. Health Soc. Behav., 24(4):385–396, 1983

  7. [7]

    Kirschbaum, K

    C. Kirschbaum, K. M. Pirke, and D. H. Hellhammer. The ‘Trier Social Stress Test’—a tool for investigating psychobiological stress responses in a laboratory setting.Neuropsychobi- ology, 28(1–2):76–81, 1993

  8. [8]

    S. S. Dickerson and M. E. Kemeny. Acute stressors and cortisol responses: a theoretical integration and synthesis of laboratory research.Psychol. Bull., 130(3):355–391, 2004

  9. [9]

    A. A. Stone and S. Shiffman. Ecological momentary assessment (EMA) in behavorial medicine.Ann. Behav. Med., 16(3):199–202, 1994

  10. [10]

    J. E. Schwartz and A. A. Stone. Strategies for analyzing ecological momentary assessment data.Health Psychol., 17(1):6–16, 1998

  11. [11]

    Rabbi et al

    M. Rabbi et al. MyBehavior: automatic personalized health feedback from user behaviors and preferences using smartphones. InProc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput., pages 707–718, New Y ork, NY , USA, 2015

  12. [12]

    Sano and R

    A. Sano and R. W. Picard. Stress recognition using wearable sensors and mobile phones. InProc. Humaine Assoc. Conf. Affective Comput. Intell. Interact., pages 671–676, 2013

  13. [13]

    Gjoreski et al

    M. Gjoreski et al. Continuous stress detection using a wrist device: in laboratory and real life. InProc. ACM Int. Joint Conf. Pervasive Ubiquitous Comput. Adjunct, pages 1185–1193, 2016

  14. [14]

    Smets et al

    E. Smets et al. Large-scale wearable data reveal digital phenotypes for daily-life stress detection.npj Digit. Med., 1:67, 2018

  15. [15]

    Hovsepian et al

    K. Hovsepian et al. cStress: towards a gold standard for continuous stress assessment in the mobile environment. InProc. ACM Int. Conf. Ubiquitous Comput., pages 493–504, 2015

  16. [16]

    Schmidt et al

    P . Schmidt et al. Introducing WESAD, a multimodal dataset for wearable stress and affect detection. InProc. 20th ACM Int. Conf. Multimodal Interact., pages 400–408, New Y ork, NY , USA, 2018

  17. [17]

    Y . S. Can, N. Chalabianloo, D. Ekiz, and C. Ersoy. Continuous stress detection using wear- able sensors in real life: algorithmic programming contest case study.Sensors, 19(8):1849, 2019

  18. [18]

    Kim et al

    H.-G. Kim et al. Stress and heart rate variability: a meta-analysis and review of the literature. Psychiatry Investig., 15(3):235–245, 2018

  19. [19]

    Boucsein.Electrodermal Activity

    W. Boucsein.Electrodermal Activity. Springer, Boston, MA, USA, 2012

  20. [20]

    J. A. Healey and R. W. Picard. Detecting stress during real-world driving tasks using physi- ological sensors.IEEE T rans. Intell. T ransp. Syst., 6(2):156–166, 2005

  21. [21]

    de Santos Sierra et al

    A. de Santos Sierra et al. Stress detection by means of stress physiological template. In Proc. 3rd World Congr. Nat. Biol. Inspired Comput., pages 131–136, 2011

  22. [22]

    Castaldo et al

    R. Castaldo et al. Acute mental stress assessment via short term HRV analysis in healthy adults: a systematic review with meta-analysis.Biomed. Signal Process. Control, 18:370– 377, 2015

  23. [23]

    Y . S. Can, B. Arnrich, and C. Ersoy. Stress detection in daily life scenarios using smart phones and wearable sensors: a survey.J. Biomed. Inform., 92:103139, 2019

  24. [24]

    E. Jovanov. Wearables meet IoT: synergistic personal area networks (SPANs).Sensors, 19 (19):4295, 2019

  25. [25]

    Kim et al

    J. Kim et al. Co-variation of depressive mood and locomotor dynamics evaluated by eco- logical momentary assessment in healthy humans.PLOS ONE, 8(9):e74979, 2013

  26. [26]

    Taylor et al

    S. Taylor et al. Automatic identification of artifacts in electrodermal activity data. InProc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc., pages 1934–1937, 2015

  27. [27]

    K. E. Heron and J. M. Smyth. Ecological momentary interventions: incorporating mobile technology into psychosocial and health behaviour treatments.Br. J. Health Psychol., 15 (1):1–39, 2010

  28. [28]

    Muaremi, B

    A. Muaremi, B. Arnrich, and G. Tröster. Towards measuring stress with smartphones and wearable devices during workday and sleep.BioNanoSci., 3(2):172–183, 2013

  29. [29]

    Miller et al

    R. Miller et al. Classification criteria for distinguishing cortisol responders from nonrespon- ders to psychosocial stress.Psychosom. Med., 75(9):832–840, 2013

  30. [30]

    D. H. Hellhammer, S. Wüst, and B. M. Kudielka. Salivary cortisol as a biomarker in stress research.Psychoneuroendocrinology, 34(2):163–171, 2009

  31. [31]

    Foley and C

    P . Foley and C. Kirschbaum. Human hypothalamus-pituitary-adrenal axis responses to acute psychosocial stress in laboratory settings.Neurosci. Biobehav. Rev., 35(1):91–96, 2010

  32. [32]

    Dedovic et al

    K. Dedovic et al. The brain and the stress axis: the neural correlates of cortisol regulation in response to stress.NeuroImage, 47(3):864–871, 2009

  33. [33]

    G. E. Miller, E. Chen, and E. S. Zhou. If it goes up, must it come down? Chronic stress and the hypothalamic-pituitary-adrenocortical axis in humans.Psychol. Bull., 133(1):25–45, 2007

  34. [34]

    K. E. Heron et al. Using mobile-technology-based ecological momentary assessment (EMA) methods with youth: a systematic review and recommendations.J. Pediatr. Psychol., 42 (10):1087–1107, 2017. 8