pith. machine review for the scientific record. sign in

arxiv: 2605.00973 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.AI· eess.SP

Recognition: unknown

Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning

Cyrus Tanade, Eugene Hwang, Hao Zhou, Juhyeon Lee, Justin Sung, Keum San Chun, Li Zhu, Md Mahbubur Rahman, Megha Thukral, Mehrab Bin Morshed, Migyeong Gwak, Sharanya Arcot Desai, Simon A. Lee, Subramaniam Venkatraman, Viswam Nathan

Pith reviewed 2026-05-09 20:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIeess.SP
keywords biosignalsself-supervised learningmasked autoencodersECGPPGrepresentation learningpretrainingmultimodal
0
0 comments X

The pith

Pretraining with masked cross-modal reconstruction between temporally ordered biosignals like ECG and PPG produces representations that outperform unimodal and multimodal baselines on 15 of 19 downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Biosignals from different body sites often record sequential stages of the same physiological event, with ECG detecting electrical initiation of a heartbeat before PPG registers the resulting pulse wave. Most self-supervised methods ignore this ordering and treat the signals as interchangeable views. xMAE instead uses masked cross-modal reconstruction during pretraining to enforce directional timing structure in the learned embeddings. The resulting representations improve performance on cardiovascular outcome prediction, sleep staging, lab test anomaly detection, and demographic inference, and they transfer across devices, sensor placements, and recording conditions.

Core claim

xMAE is a biosignal pretraining framework that leverages masked cross-modal reconstruction across temporally ordered biosignals as a training-time constraint to encourage physiologically meaningful timing structure in the learned representations. Pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis indicates that the ECG-PPG timing structure is reflected in the learned PPG representations.

What carries the argument

The masked cross-modal reconstruction objective that reconstructs one temporally delayed biosignal (such as PPG) from masked patches of an earlier signal (such as ECG) to embed directional physiological timing.

If this is right

  • Representations transfer to 15 of 19 tasks spanning outcome prediction, anomaly detection, sleep staging, and demographics.
  • Performance gains hold when models are tested on new devices, sensor sites, and acquisition protocols.
  • Learned PPG embeddings encode measurable ECG-to-PPG timing offsets.
  • The approach applies to any multimodal biosignals that observe successive stages of one underlying process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ordering-aware reconstruction could be applied to other causally linked signal pairs such as respiratory effort before oxygen saturation changes.
  • Wearable systems might benefit from pretraining on paired ECG-PPG streams to improve real-time fusion without explicit alignment modules.
  • Similar constraints may help in other domains where one modality precedes another, such as audio preceding video in speech events.

Load-bearing premise

That the directional timing relationship between signals can be effectively enforced as a reconstruction constraint during pretraining and will produce representations that measurably improve downstream task performance.

What would settle it

A control model pretrained with standard masked reconstruction or contrastive objectives but without any cross-modal ordering constraint achieves equal or higher accuracy on the same 19 downstream tasks.

read the original abstract

Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process. However, most existing self supervised learning methods treat these signals as interchangeable views, overlooking the directional temporal dynamics that link them. A canonical example is the relationship between electrocardiography (ECG), which captures the electrical activation initiating each heartbeat, and photoplethysmography (PPG), which records the resulting peripheral pulse delayed by vascular dynamics. To capture this structured relationship, we introduce xMAE, a biosignal pretraining framework that leverages masked cross modal reconstruction across temporally ordered biosignals as a training time constraint to encourage physiologically meaningful timing structure in the learned representations. We show that pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis suggests that the ECG PPG timing structure is reflected in the learned PPG representations. More broadly, xMAE demonstrates the effectiveness of incorporating temporal structure into multimodal pretraining when signals observe different stages of a shared underlying process. Code is available at https://github.com/hzhou3/xMAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces xMAE, a self-supervised pretraining framework for biosignals that performs masked cross-modal reconstruction between temporally ordered signals (e.g., ECG preceding PPG due to vascular delay) to learn physiologically structured representations. It reports that this approach outperforms unimodal and multimodal baselines on 15 of 19 downstream tasks spanning cardiovascular outcome prediction, abnormal lab test detection, sleep staging, and demographic inference, with generalization across devices, body locations, and settings. Additional analysis indicates that the learned PPG representations reflect the ECG-PPG timing structure.

Significance. If the empirical results are robust and the directional timing mechanism is shown to be causal for the gains, this work would be significant for advancing multimodal self-supervised learning in biosignals by incorporating physiological priors rather than treating signals as interchangeable views. The release of code supports reproducibility. It could influence pretraining strategies for other temporally structured multimodal data in healthcare.

major comments (1)
  1. Experiments section: No ablation study isolates the effect of the directional temporal ordering (e.g., by randomizing PPG relative to ECG or using symmetric bidirectional reconstruction without delay modeling) while holding masking, architecture, and other factors fixed. This is load-bearing for the central claim, as the reported gains on 15 of 19 tasks (including cardiovascular, sleep, and lab tasks) could arise from generic cross-modal pretraining rather than the physiology-aware timing structure.
minor comments (1)
  1. Abstract: The claim of outperformance on 15 of 19 tasks is stated without reference to specific baseline definitions, number of runs, or statistical tests, which would strengthen the summary for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the evidence for our central claim.

read point-by-point responses
  1. Referee: Experiments section: No ablation study isolates the effect of the directional temporal ordering (e.g., by randomizing PPG relative to ECG or using symmetric bidirectional reconstruction without delay modeling) while holding masking, architecture, and other factors fixed. This is load-bearing for the central claim, as the reported gains on 15 of 19 tasks (including cardiovascular, sleep, and lab tasks) could arise from generic cross-modal pretraining rather than the physiology-aware timing structure.

    Authors: We agree that the manuscript lacks a dedicated ablation that isolates the directional temporal ordering while holding masking, architecture, and other factors fixed. Our current analysis shows that the learned PPG representations reflect the ECG-PPG timing structure, but this does not fully rule out that gains could arise from generic cross-modal pretraining. We will add the requested ablation (including randomized relative timing and symmetric bidirectional reconstruction) in the revised version to directly test causality of the physiology-aware timing mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on downstream tasks are independent of the pretraining objective definition

full rationale

The paper defines xMAE as a masked cross-modal reconstruction objective that incorporates an external physiological fact (temporal ordering between ECG and PPG due to vascular delay). It then reports measured performance gains on 15 of 19 downstream tasks. This chain does not reduce any claimed result to a fitted parameter renamed as prediction, a self-referential definition, or a load-bearing self-citation. The temporal constraint is imported from physiology rather than derived from the model, and the outperformance numbers are obtained via standard evaluation rather than forced by construction. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that biosignals from different body locations provide temporally ordered views of the same process and that masked cross-modal reconstruction can encode this structure into useful representations.

axioms (1)
  • domain assumption Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process.
    This premise is stated in the first sentence of the abstract and directly motivates the cross-modal timing constraint.

pith-pipeline@v0.9.0 · 5587 in / 1307 out tokens · 73204 ms · 2026-05-09T20:02:24.314794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    Large-scale training of foundation models for wearable biosignals.arXiv preprint arXiv:2312.05409, 2023

    S. Abbaspourazad, O. Elachqar, A. C. Miller, S. Emrani, U. Nallasamy, and I. Shapiro. Large-scale training of foundation models for wearable biosignals. arXiv preprint arXiv:2312.05409,

  2. [2]

    M. A. Ahmad, C. Eckert, and A. Teredesai. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, pages 559--560,

  3. [3]

    A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815,

  4. [4]

    C. Ding, Z. Guo, Z. Chen, R. J. Lee, C. Rudin, and X. Hu. Siamquality: A convnet-based foundation model for imperfect physiological signals. arXiv preprint arXiv:2404.17667,

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

  6. [6]

    Erturk, F

    E. Erturk, F. Kamran, S. Abbaspourazad, S. Jewell, H. Sharma, Y. Li, S. Williamson, N. J. Foti, and J. Futoma. Beyond sensor data: Foundation models of behavioral data from wearables improve health predictions. arXiv preprint arXiv:2507.00191,

  7. [7]

    C. Fang, C. Sandino, B. Mahasseni, J. Minxha, H. Pouransari, E. Azemi, A. Moin, and E. Zippi. Pro- moting cross-modal representations to improve multimodal foundation models for physiological signals. arXiv preprint arXiv:2410.16424,

  8. [8]

    X. Fang, J. Jin, H. Wang, C. Liu, J. Cai, G. Nie, J. Li, H. Li, and S. Hong. Ppgflowecg: Latent rectified flow with cross-modal encoding for ppg-guided ecg generation and cardiovascular disease detection. arXiv preprint arXiv:2509.19774,

  9. [9]

    N. C. Kong, D. Lee, H. Do, D. H. Park, C. Xu, H. Mao, and J. Chung. f-gan: A frequency- domain-constrained generative adversarial network for ppg to ecg synthesis. arXiv preprint arXiv:2406.16896,

  10. [10]

    S. A. Lee, C. Tanade, H. Zhou, J. Lee, M. Thukral, M. Han, R. Choi, M. S. H. Khan, B. Lu, M. Gwak, et al. Himae: Hierarchical masked autoencoders discover resolution-specific structure in wearable time series. arXiv preprint arXiv:2510.25785,

  11. [11]

    J. Li, A. Aguirre, J. Moura, C. Liu, L. Zhong, C. Sun, G. Clifford, B. Westover, and S. Hong. An electrocardiogram foundation model built on over 10 million recordings with external evaluation across multiple domains. arXiv preprint arXiv:2410.04133,

  12. [12]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,

  13. [13]

    Mukkamala, J.-O

    R. Mukkamala, J.-O. Hahn, O. T. Inan, L. K. Mestha, C.-S. Kim, H. Töreyin, and S. Kyal. Toward ubiq- uitous blood pressure monitoring via pulse transit time: theory and practice. IEEE transactions on biomedical engineering, 62(8):1879--1901,

  14. [14]

    Narayanswamy, X

    G. Narayanswamy, X. Liu, K. Ayush, Y. Yang, X. Xu, S. Liao, J. Garrison, S. Tailor, J. Sunshine, Y. Liu, et al. Scaling wearable foundation models. arXiv preprint arXiv:2410.13638,

  15. [15]

    G. Nie, G. Tang, Y. Xiao, J. Li, S. Huang, D. Zhang, Q. Zhao, and S. Hong. Anyppg: An ecg-guided ppg foundation model trained on over 100,000 hours of recordings for holistic health profiling. arXiv preprint arXiv:2511.01747,

  16. [16]

    & Malekzadeh, M

    A. Pillai, D. Spathis, F. Kawsar, and M. Malekzadeh. Papagei: Open foundation models for optical physiological signals. arXiv preprint arXiv:2410.20542,

  17. [17]

    Thapa, B

    R. Thapa, B. He, M. R. Kjaer, H. Moore, G. Ganjoo, E. Mignot, and J. Zou. Sleepfm: Multi-modal representation learning for sleep across brain activity, ecg and respiratory signals. arXiv preprint arXiv:2405.17766,

  18. [18]

    K. Wang, J. Yang, A. Shetty, and J. Dunn. Dreamt: Dataset for real-time sleep stage estimation using multisensor wearable technology. PhysioNet https://doi.org/10.13026/62AN-CB28,

  19. [19]

    W. Whelton. 2017 guideline for the prevention, detection, evaluation, and management of high blood pressure in adults. J Am Coll Cardiol,

  20. [20]

    M. A. Xu, G. Narayanswamy, K. Ayush, D. Spathis, S. Liao, S. A. Tailor, A. Metwally, A. A. Heydari, Y. Zhang, J. Garrison, et al. Lsm-2: Learning from incomplete wearable sensor data. arXiv preprint arXiv:2506.05321,

  21. [21]

    H. Zhou, M. M. Rahman, M. B. Morshed, Y. Li, M. S. Islam, L. Zhang, J. Bae, C. Rosa, W. B. Mendes, and J. Kuang. Know your heart better: Multimodal cardiac output monitoring using earbuds. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE,

  22. [22]

    Signal Preprocessing Pipeline To facilitate pretraining and evaluation, we follow a standard preprocessing pipeline that ensures high-quality PPG and ECG segments

    18 Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning A. Signal Preprocessing Pipeline To facilitate pretraining and evaluation, we follow a standard preprocessing pipeline that ensures high-quality PPG and ECG segments. This preprocessing pipeline for PPG and ECG is consistent across all pretraining and evaluation st...

  23. [23]

    Input We consider paired photoplethysmography (PPG) and electrocardiography (ECG) signals collected synchronously from the same subject. Each input sample consists of a 10-second segment sampled at 100 Hz, yielding sequences 𝑃∈R 𝐿, 𝐸∈R 𝐿, 𝐿= 1000.(5) Curriculum ECG Masking Strategy We adopt a curriculum learning strategy over the ECG masking ratio to prog...

  24. [24]

    Learnable positional embeddings are added to encode temporal order

    This yields 𝑍∈R 𝑁 ′×𝑑, 𝑁 ′ = ⌊︂𝐿′ 𝑃 ⌋︂ .(8) For fully observed PPG, this results in 𝑁= 25 tokens per segment (length is 40; 40 × 25 = 1000). Learnable positional embeddings are added to encode temporal order. PPG and visible ECG tokens are then processed independently by modality-specific Transformer encoders: 𝑍 ′ 𝑃 = Enc𝑃 (𝑍𝑃 ), 𝑍 ′ 𝐸 = Enc𝐸(𝑍𝐸).(9) The ...

  25. [25]

    PulsePPG (Open-Source Weights) Saha et al

    We use the pretrained PPG encoder as provided, and evaluate its representations on our downstream tasks without additional pretraining or task-specific adaptation. PulsePPG (Open-Source Weights) Saha et al. (2025) For this baseline, we adopt the official PulsePPG implementation and released pretrained weights 4, and evaluate the model on our downstream ta...

  26. [26]

    All training and evaluation are performed on NVIDIA H200 GPUs. C. Evaluation Datasets, Tasks and Protocols In this section, we introduce datasets, tasks, and protocols that are employed for evaluation. C.1. Evaluation Datasets and Tasks In total, we have 19 tasks from 6 datasets, including classification and regression. All datasets analyzed in this proje...

  27. [27]

    Random Seed We set the random seed to 1 across all tasks and evaluations

    We kept the hyperparameters, such as learning rate (1e-5), batch size (2048) same across models. Random Seed We set the random seed to 1 across all tasks and evaluations. D. Justification of Curriculum ECG Masking We provide a justification of our choice on curriculum ECG masking. Let ℒ(𝑀;𝜃) denote the masked cross-modal reconstruction loss under ECG mask...

  28. [28]

    E.4. Evidence 2: xMAE Captures the Time Delay Better than Multimodal Baselines Figure 14 evaluates how well different models preserve the physiological time delay between ECG and PPG by comparing the absolute error between the ground-truth delay, Δ𝑡𝑔𝑡, com- puted from real ECG–PPG pairs, and the delay estimated from reconstructed signals. Using Neurokit2 ...

  29. [29]

    Again, these models are trained with different architectures, different sizes, and different pretraining datasets. Yet, xMAE consistently achieves comparable performance on clinically and physiologi- cally grounded tasks, particularly cardiovascular outcomes and laboratory test prediction, where accurate modeling of beat-level timing and pulse dynamics is...