Recognition: unknown
Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning
Pith reviewed 2026-05-09 20:02 UTC · model grok-4.3
The pith
Pretraining with masked cross-modal reconstruction between temporally ordered biosignals like ECG and PPG produces representations that outperform unimodal and multimodal baselines on 15 of 19 downstream tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
xMAE is a biosignal pretraining framework that leverages masked cross-modal reconstruction across temporally ordered biosignals as a training-time constraint to encourage physiologically meaningful timing structure in the learned representations. Pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis indicates that the ECG-PPG timing structure is reflected in the learned PPG representations.
What carries the argument
The masked cross-modal reconstruction objective that reconstructs one temporally delayed biosignal (such as PPG) from masked patches of an earlier signal (such as ECG) to embed directional physiological timing.
If this is right
- Representations transfer to 15 of 19 tasks spanning outcome prediction, anomaly detection, sleep staging, and demographics.
- Performance gains hold when models are tested on new devices, sensor sites, and acquisition protocols.
- Learned PPG embeddings encode measurable ECG-to-PPG timing offsets.
- The approach applies to any multimodal biosignals that observe successive stages of one underlying process.
Where Pith is reading between the lines
- The same ordering-aware reconstruction could be applied to other causally linked signal pairs such as respiratory effort before oxygen saturation changes.
- Wearable systems might benefit from pretraining on paired ECG-PPG streams to improve real-time fusion without explicit alignment modules.
- Similar constraints may help in other domains where one modality precedes another, such as audio preceding video in speech events.
Load-bearing premise
That the directional timing relationship between signals can be effectively enforced as a reconstruction constraint during pretraining and will produce representations that measurably improve downstream task performance.
What would settle it
A control model pretrained with standard masked reconstruction or contrastive objectives but without any cross-modal ordering constraint achieves equal or higher accuracy on the same 19 downstream tasks.
read the original abstract
Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process. However, most existing self supervised learning methods treat these signals as interchangeable views, overlooking the directional temporal dynamics that link them. A canonical example is the relationship between electrocardiography (ECG), which captures the electrical activation initiating each heartbeat, and photoplethysmography (PPG), which records the resulting peripheral pulse delayed by vascular dynamics. To capture this structured relationship, we introduce xMAE, a biosignal pretraining framework that leverages masked cross modal reconstruction across temporally ordered biosignals as a training time constraint to encourage physiologically meaningful timing structure in the learned representations. We show that pretraining with xMAE yields representations that outperform both unimodal and multimodal baselines on 15 of 19 downstream tasks, including cardiovascular outcome prediction, abnormal laboratory test detection, sleep staging, and demographic inference, while generalizing across devices, body locations, and acquisition settings. Further analysis suggests that the ECG PPG timing structure is reflected in the learned PPG representations. More broadly, xMAE demonstrates the effectiveness of incorporating temporal structure into multimodal pretraining when signals observe different stages of a shared underlying process. Code is available at https://github.com/hzhou3/xMAE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces xMAE, a self-supervised pretraining framework for biosignals that performs masked cross-modal reconstruction between temporally ordered signals (e.g., ECG preceding PPG due to vascular delay) to learn physiologically structured representations. It reports that this approach outperforms unimodal and multimodal baselines on 15 of 19 downstream tasks spanning cardiovascular outcome prediction, abnormal lab test detection, sleep staging, and demographic inference, with generalization across devices, body locations, and settings. Additional analysis indicates that the learned PPG representations reflect the ECG-PPG timing structure.
Significance. If the empirical results are robust and the directional timing mechanism is shown to be causal for the gains, this work would be significant for advancing multimodal self-supervised learning in biosignals by incorporating physiological priors rather than treating signals as interchangeable views. The release of code supports reproducibility. It could influence pretraining strategies for other temporally structured multimodal data in healthcare.
major comments (1)
- Experiments section: No ablation study isolates the effect of the directional temporal ordering (e.g., by randomizing PPG relative to ECG or using symmetric bidirectional reconstruction without delay modeling) while holding masking, architecture, and other factors fixed. This is load-bearing for the central claim, as the reported gains on 15 of 19 tasks (including cardiovascular, sleep, and lab tasks) could arise from generic cross-modal pretraining rather than the physiology-aware timing structure.
minor comments (1)
- Abstract: The claim of outperformance on 15 of 19 tasks is stated without reference to specific baseline definitions, number of runs, or statistical tests, which would strengthen the summary for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the evidence for our central claim.
read point-by-point responses
-
Referee: Experiments section: No ablation study isolates the effect of the directional temporal ordering (e.g., by randomizing PPG relative to ECG or using symmetric bidirectional reconstruction without delay modeling) while holding masking, architecture, and other factors fixed. This is load-bearing for the central claim, as the reported gains on 15 of 19 tasks (including cardiovascular, sleep, and lab tasks) could arise from generic cross-modal pretraining rather than the physiology-aware timing structure.
Authors: We agree that the manuscript lacks a dedicated ablation that isolates the directional temporal ordering while holding masking, architecture, and other factors fixed. Our current analysis shows that the learned PPG representations reflect the ECG-PPG timing structure, but this does not fully rule out that gains could arise from generic cross-modal pretraining. We will add the requested ablation (including randomized relative timing and symmetric bidirectional reconstruction) in the revised version to directly test causality of the physiology-aware timing mechanism. revision: yes
Circularity Check
No circularity: empirical results on downstream tasks are independent of the pretraining objective definition
full rationale
The paper defines xMAE as a masked cross-modal reconstruction objective that incorporates an external physiological fact (temporal ordering between ECG and PPG due to vascular delay). It then reports measured performance gains on 15 of 19 downstream tasks. This chain does not reduce any claimed result to a fitted parameter renamed as prediction, a self-referential definition, or a load-bearing self-citation. The temporal constraint is imported from physiology rather than derived from the model, and the outperformance numbers are obtained via standard evaluation rather than forced by construction. The derivation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Biosignals acquired from different locations on the body often provide temporally ordered views of the same underlying physiological process.
Reference graph
Works this paper leans on
-
[1]
S. Abbaspourazad, O. Elachqar, A. C. Miller, S. Emrani, U. Nallasamy, and I. Shapiro. Large-scale training of foundation models for wearable biosignals. arXiv preprint arXiv:2312.05409,
-
[2]
M. A. Ahmad, C. Eckert, and A. Teredesai. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM international conference on bioinformatics, computational biology, and health informatics, pages 559--560,
2018
-
[3]
A. F. Ansari, L. Stella, C. Turkmen, X. Zhang, P. Mercado, H. Shen, O. Shchur, S. S. Rangapuram, S. P. Arango, S. Kapoor, et al. Chronos: Learning the language of time series. arXiv preprint arXiv:2403.07815,
work page internal anchor Pith review arXiv
- [4]
-
[5]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [6]
- [7]
- [8]
- [9]
- [10]
- [11]
-
[12]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Mukkamala, J.-O
R. Mukkamala, J.-O. Hahn, O. T. Inan, L. K. Mestha, C.-S. Kim, H. Töreyin, and S. Kyal. Toward ubiq- uitous blood pressure monitoring via pulse transit time: theory and practice. IEEE transactions on biomedical engineering, 62(8):1879--1901,
1901
-
[14]
G. Narayanswamy, X. Liu, K. Ayush, Y. Yang, X. Xu, S. Liao, J. Garrison, S. Tailor, J. Sunshine, Y. Liu, et al. Scaling wearable foundation models. arXiv preprint arXiv:2410.13638,
- [15]
-
[16]
A. Pillai, D. Spathis, F. Kawsar, and M. Malekzadeh. Papagei: Open foundation models for optical physiological signals. arXiv preprint arXiv:2410.20542,
- [17]
-
[18]
K. Wang, J. Yang, A. Shetty, and J. Dunn. Dreamt: Dataset for real-time sleep stage estimation using multisensor wearable technology. PhysioNet https://doi.org/10.13026/62AN-CB28,
-
[19]
W. Whelton. 2017 guideline for the prevention, detection, evaluation, and management of high blood pressure in adults. J Am Coll Cardiol,
2017
- [20]
-
[21]
H. Zhou, M. M. Rahman, M. B. Morshed, Y. Li, M. S. Islam, L. Zhang, J. Bae, C. Rosa, W. B. Mendes, and J. Kuang. Know your heart better: Multimodal cardiac output monitoring using earbuds. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE,
2025
-
[22]
Signal Preprocessing Pipeline To facilitate pretraining and evaluation, we follow a standard preprocessing pipeline that ensures high-quality PPG and ECG segments
18 Physiology-Aware Masked Cross-Modal Reconstruction for Biosignal Representation Learning A. Signal Preprocessing Pipeline To facilitate pretraining and evaluation, we follow a standard preprocessing pipeline that ensures high-quality PPG and ECG segments. This preprocessing pipeline for PPG and ECG is consistent across all pretraining and evaluation st...
2003
-
[23]
Input We consider paired photoplethysmography (PPG) and electrocardiography (ECG) signals collected synchronously from the same subject. Each input sample consists of a 10-second segment sampled at 100 Hz, yielding sequences 𝑃∈R 𝐿, 𝐸∈R 𝐿, 𝐿= 1000.(5) Curriculum ECG Masking Strategy We adopt a curriculum learning strategy over the ECG masking ratio to prog...
2020
-
[24]
Learnable positional embeddings are added to encode temporal order
This yields 𝑍∈R 𝑁 ′×𝑑, 𝑁 ′ = ⌊︂𝐿′ 𝑃 ⌋︂ .(8) For fully observed PPG, this results in 𝑁= 25 tokens per segment (length is 40; 40 × 25 = 1000). Learnable positional embeddings are added to encode temporal order. PPG and visible ECG tokens are then processed independently by modality-specific Transformer encoders: 𝑍 ′ 𝑃 = Enc𝑃 (𝑍𝑃 ), 𝑍 ′ 𝐸 = Enc𝐸(𝑍𝐸).(9) The ...
2019
-
[25]
PulsePPG (Open-Source Weights) Saha et al
We use the pretrained PPG encoder as provided, and evaluate its representations on our downstream tasks without additional pretraining or task-specific adaptation. PulsePPG (Open-Source Weights) Saha et al. (2025) For this baseline, we adopt the official PulsePPG implementation and released pretrained weights 4, and evaluate the model on our downstream ta...
2025
-
[26]
All training and evaluation are performed on NVIDIA H200 GPUs. C. Evaluation Datasets, Tasks and Protocols In this section, we introduce datasets, tasks, and protocols that are employed for evaluation. C.1. Evaluation Datasets and Tasks In total, we have 19 tasks from 6 datasets, including classification and regression. All datasets analyzed in this proje...
2025
-
[27]
Random Seed We set the random seed to 1 across all tasks and evaluations
We kept the hyperparameters, such as learning rate (1e-5), batch size (2048) same across models. Random Seed We set the random seed to 1 across all tasks and evaluations. D. Justification of Curriculum ECG Masking We provide a justification of our choice on curriculum ECG masking. Let ℒ(𝑀;𝜃) denote the masked cross-modal reconstruction loss under ECG mask...
2048
-
[28]
E.4. Evidence 2: xMAE Captures the Time Delay Better than Multimodal Baselines Figure 14 evaluates how well different models preserve the physiological time delay between ECG and PPG by comparing the absolute error between the ground-truth delay, Δ𝑡𝑔𝑡, com- puted from real ECG–PPG pairs, and the delay estimated from reconstructed signals. Using Neurokit2 ...
2021
-
[29]
Again, these models are trained with different architectures, different sizes, and different pretraining datasets. Yet, xMAE consistently achieves comparable performance on clinically and physiologi- cally grounded tasks, particularly cardiovascular outcomes and laboratory test prediction, where accurate modeling of beat-level timing and pulse dynamics is...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.