pith. sign in

arxiv: 2606.01537 · v2 · pith:UUOFRAMUnew · submitted 2026-06-01 · 💻 cs.CV · cs.LG

PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder

Pith reviewed 2026-06-28 15:51 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords chest X-raymasked autoencodercross-modal learningself-supervised pretrainingphysiological priorsECGmedical imaginglabel efficiency
0
0 comments X

The pith

PaCX-MAE transfers physiological knowledge from ECG and lab data into chest X-ray encoders during pretraining for better unimodal performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a pretraining approach for chest X-ray models that incorporates information from paired physiological measurements such as ECG signals and laboratory results. This is done through alignment objectives added to standard masked autoencoding, allowing the visual model to learn richer features. The key benefit is improved accuracy on tasks that benefit from physiological context, like certain diagnoses, while the model can still be used with only X-ray images afterward. Tests across multiple datasets show gains over regular MAE methods, with particular strength when labeled data is scarce.

Core claim

PaCX-MAE augments in-domain masked autoencoding with a dual contrastive-predictive objective that aligns chest X-ray representations with embeddings from paired ECG and laboratory data. This cross-modal distillation injects physiological priors into the encoder. Evaluations across nine benchmarks show consistent improvements over domain-specific MAE, especially on physiology-dependent tasks, with high label efficiency and preserved performance on segmentation.

What carries the argument

Dual contrastive-predictive objective for aligning CXR representations with ECG and laboratory embeddings during pretraining.

If this is right

  • Improved results on physiology-dependent tasks such as those measured by AUROC on MedMod and F1 on VinDr.
  • Strong performance in low-label regimes like 1% labeled data.
  • Maintained accuracy on anatomical segmentation tasks comparable to standard MAE.
  • Learned attention to physiological features like the cardiac silhouette.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar alignment strategies could apply to other medical imaging domains with available physiological data.
  • The approach may reduce the need for extensive labeled datasets in medical AI development.
  • Clinical deployment could benefit from models that implicitly capture physiological context from imaging alone.

Load-bearing premise

Paired ECG and laboratory data provide useful physiological priors that can be transferred to chest X-ray interpretation via alignment without causing biases or requiring multimodal data at inference.

What would settle it

An ablation study where removing the physiological alignment leads to no improvement or degradation on physiology-dependent benchmarks, or introduces measurable biases in predictions.

Figures

Figures reproduced from arXiv: 2606.01537 by Kenichi Maeda, Manan Pancholy, Yancheng Liu.

Figure 2
Figure 2. Figure 2: MAE Pretraining Reconstructions. Top: Original CXRs; Bottom: Reconstructions under 90% masking. Despite extreme sparsity, the model accurately recovers key physiological indicators such as the cardiac boundary and diaphragm curvature. Physiological Targets (ECG & Labs). We employ high￾fidelity, frozen encoders to serve as distillation targets. For Laboratory data, we pretrain a mask-aware Denoising Au￾toen… view at source ↗
Figure 1
Figure 1. Figure 1: Overview of the PaCX architecture. The pipeline com￾prises unimodal pretraining (Stage 1) and cross-modal distillation (Stage 2). Colors indicate optimization status: red (trainable), orange (LoRA-adapted), and blue (frozen). During distillation, the CXR encoder learns to predict physiological embeddings via lightweight heads, which are discarded at inference. 3.1. Stage 1: Unimodal Pretraining We utilize … view at source ↗
Figure 3
Figure 3. Figure 3: Label Efficiency. PaCX (red) outperforms MAE (grey) consistently at 1% and 10% training data, demonstrating robust few-shot transfer. Low-Data Efficiency. PaCX significantly lowers sample complexity. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention Shift. PaCX (middle-right) shifts focus from body structures to the cardiac silhouette (red in Difference Map). 5.3. Component Analysis To disentangle the contributions of specific signals and ob￾jectives, we analyze component-wise performance on three physiology-dense benchmarks ( [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Additional Attention Rollout Cases. These visualizations confirm the consistent trend of PaCX attending to soft-tissue anatomy versus the edge-focused attention of the MAE baseline. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Clinical diagnosis often requires combining imaging with physiological measurements, yet deployed models typically operate on unimodal data. We present PaCX-MAE, a cross-modal distillation framework that injects physiological priors into chest X-ray (CXR) encoders while remaining strictly unimodal at inference. PaCX-MAE augments in-domain masked autoencoding with a dual contrastive-predictive objective, aligning CXR representations with paired ECG and laboratory embeddings. Extensive evaluation across nine benchmarks demonstrates consistent improvements over domain-specific MAE, particularly on physiology-dependent tasks (e.g., +2.7 AUROC on MedMod; +6.5 F1 on VinDr). The method proves highly label-efficient in the 1% regime and preserves anatomical fidelity, achieving parity with MAE on segmentation tasks. Zero-shot and attention analyses confirm that PaCX-MAE successfully learns to attend to physiological indicators, such as the cardiac silhouette, absent in standard visual pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PaCX-MAE, a cross-modal distillation framework for chest X-ray (CXR) masked autoencoders that augments standard in-domain MAE pretraining with a dual contrastive-predictive objective. This objective aligns CXR representations with embeddings from paired ECG and laboratory data to inject physiological priors, while ensuring the model remains strictly unimodal at inference. The work reports consistent gains over domain-specific MAE across nine benchmarks (e.g., +2.7 AUROC on MedMod, +6.5 F1 on VinDr), strong label efficiency in the 1% regime, parity on segmentation tasks, and improved attention to physiological indicators such as the cardiac silhouette.

Significance. If the reported gains hold after controlling for selection bias in the paired pretraining subset, the approach would demonstrate a practical route to transferring physiological context into unimodal CXR encoders. The label-efficiency results and zero-shot attention analyses would be particularly valuable for medical imaging self-supervised learning, where paired multimodal data are scarce at deployment but available during pretraining.

major comments (3)
  1. [Experiments / Evaluation] The central claim that physiological priors from paired ECG/lab data transfer without introducing biases rests on the assumption that the paired training subset is representative of the broader CXR distribution. The manuscript does not report whether benchmark test sets were matched to the paired subset (by disease severity, demographics, or site) or whether an ablation replacing physiological signals with non-informative auxiliary inputs was performed; without these controls the gains on physiology-dependent tasks cannot be unambiguously attributed to the dual objective.
  2. [Method] §3 (Method): the dual contrastive-predictive alignment is described at a high level, but no equations, loss formulations, or hyperparameter schedules are supplied for the contrastive and predictive terms. This prevents verification that the alignment transfers physiological information rather than simply acting as an additional regularizer.
  3. [Experiments] Table 2 (or equivalent results table): the 1% label-efficiency regime shows large gains, yet no statistical significance tests, multiple random seeds, or confidence intervals are reported. This weakens the claim that the method is "highly label-efficient" relative to standard MAE.
minor comments (2)
  1. [Abstract] The abstract states performance numbers without any methodological details, ablation studies, or baseline descriptions; while the full manuscript presumably supplies these, the abstract should at minimum name the nine benchmarks and the primary baselines.
  2. [Method] Notation for the dual objective (contrastive vs. predictive terms) is introduced without a clear diagram or pseudocode, making the architecture harder to follow than necessary.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important aspects of experimental rigor and methodological clarity that we address below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments / Evaluation] The central claim that physiological priors from paired ECG/lab data transfer without introducing biases rests on the assumption that the paired training subset is representative of the broader CXR distribution. The manuscript does not report whether benchmark test sets were matched to the paired subset (by disease severity, demographics, or site) or whether an ablation replacing physiological signals with non-informative auxiliary inputs was performed; without these controls the gains on physiology-dependent tasks cannot be unambiguously attributed to the dual objective.

    Authors: We agree that explicit controls for selection bias are necessary to support attribution of gains to the physiological signals. In the revised manuscript we will add a supplementary table comparing the paired pretraining subset to each benchmark test set on key covariates (age, sex, disease prevalence, acquisition site). We will also include a new ablation replacing the ECG and laboratory embeddings with random vectors drawn from the same distribution, confirming that performance drops to levels comparable with standard MAE and thereby isolating the contribution of the informative physiological priors. revision: yes

  2. Referee: [Method] §3 (Method): the dual contrastive-predictive alignment is described at a high level, but no equations, loss formulations, or hyperparameter schedules are supplied for the contrastive and predictive terms. This prevents verification that the alignment transfers physiological information rather than simply acting as an additional regularizer.

    Authors: We accept that the absence of explicit formulations limits reproducibility and verification. The revised §3 will contain the complete loss equations: the contrastive term (symmetrized InfoNCE between CXR and ECG/lab embeddings) and the predictive term (regression of laboratory values from the aligned CXR representation). We will also tabulate the weighting coefficients, temperature, and learning-rate schedule used to balance the three objectives (MAE, contrastive, predictive). revision: yes

  3. Referee: [Experiments] Table 2 (or equivalent results table): the 1% label-efficiency regime shows large gains, yet no statistical significance tests, multiple random seeds, or confidence intervals are reported. This weakens the claim that the method is "highly label-efficient" relative to standard MAE.

    Authors: We acknowledge that reporting variability and statistical tests is required to substantiate the label-efficiency claim. The revised Table 2 will present results averaged over five independent random seeds with standard deviations. We will additionally report p-values from paired t-tests between PaCX-MAE and the MAE baseline for each 1% setting, together with 95% confidence intervals. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method with no derivations or self-referential reductions

full rationale

The paper presents PaCX-MAE as an empirical cross-modal distillation approach evaluated on nine benchmarks. No equations, derivations, or first-principles claims appear in the provided abstract or description. All performance claims (e.g., AUROC/F1 gains) are presented as experimental outcomes rather than predictions derived from fitted parameters or self-citations that reduce to inputs by construction. The method relies on standard contrastive and predictive objectives applied to paired data, with no load-bearing steps that collapse to tautology or self-citation chains. This is the expected outcome for a methods paper focused on architecture and benchmarking.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities. The method is described as building on existing masked autoencoding and contrastive techniques.

pith-pipeline@v0.9.1-grok · 5695 in / 1275 out tokens · 34786 ms · 2026-06-28T15:51:54.478281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    Azizi and B

    S. Azizi and B. Mustafa and F. Ryan and Z. Beaver and J. Freyberg and J. Deaton and A. Loh and A. Karthikesalingam and S. Kornblith and T. Chen and V. Natarajan and M. Norouzi , title =. arXiv preprint arXiv:2101.05224 , year =

  2. [2]

    Boecking and N

    B. Boecking and N. Usuyama and S. Bannur and D. C. Castro and A. Schwaighofer and S. Hyland and M. Wetscherek and T. Naumann and A. Nori and J. Alvarez-Valle and H. Poon and O. Oktay , title =. Computer Vision -- ECCV 2022 , pages =

  3. [3]

    Dou and Q

    Q. Dou and Q. Liu and P.-A. Heng and B. Glocker , title =. IEEE Transactions on Medical Imaging , year =

  4. [4]

    Wang and others , title =

    H. Wang and others , title =. arXiv preprint arXiv:2310.01035 , year =

  5. [5]

    Jaeger and S

    S. Jaeger and S. Candemir and S. Antani and Y.-X. J. W. Two Public Chest. Quantitative Imaging in Medicine and Surgery , year =

  6. [6]

    Huang and A

    S.-C. Huang and A. Pareek and M. Jensen and M. P. Lungren and S. Yeung and A. S. Chaudhari , title =. npj Digital Medicine , year =

  7. [7]

    Li and A

    J. Li and A. D. Aguirre and V. M. Junior and J. Jin and C. Liu and L. Zhong and C. Sun and G. Clifford and M. B. Westover and S. Hong , title =. NEJM AI , year =

  8. [8]

    Gupta and I

    A. Gupta and I. Osman and M. S. Shehata and J. W. Braun , title =. arXiv preprint arXiv:2407.14784 , year =

  9. [9]

    Tiu and E

    E. Tiu and E. Talius and P. Patel and C. P. Langlotz and A. Y. Ng and P. Rajpurkar , title =. Nature Biomedical Engineering , year =

  10. [10]

    Cross Modal Distillation for Supervision Transfer

    S. Gupta and J. Hoffman and J. Malik , title =. arXiv preprint arXiv:1507.00448 , year =

  11. [11]

    Lopez-Paz and L

    D. Lopez-Paz and L. Bottou and B. Sch. Unifying Distillation and Privileged Information , journal =

  12. [12]

    Cho and K

    K. Cho and K. D. Kim and Y. Nam and J. Jeong and J. Kim and C. Choi and S. Lee and J. S. Lee and S. Woo and G.-S. Hong and J. B. Seo and N. Kim , title =. Journal of Digital Imaging , year =

  13. [13]

    Gorade and A

    V. Gorade and A. Sing and D. Mishra , title =. Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

  14. [14]

    Xiao and Y

    J. Xiao and Y. Bai and A. Yuille and Z. Zhou , title =. Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , year =

  15. [15]

    Zhou and H

    L. Zhou and H. Liu and J. Bae and J. He and D. Samaras and P. Prasanna , title =. arXiv preprint arXiv:2203.05573 , year =

  16. [16]

    Zhang and H

    Y. Zhang and H. Jiang and Y. Miura and C. D. Manning and C. P. Langlotz , title =. arXiv preprint arXiv:2010.00747 , year =

  17. [17]

    Learning Transferable Visual Models From Natural Language Supervision

    A. Radford and J. W. Kim and C. Hallacy and A. Ramesh and G. Goh and S. Agarwal and G. Sastry and A. Askell and P. Mishkin and J. Clark and G. Krueger and I. Sutskever , title =. arXiv preprint arXiv:2103.00020 , year =

  18. [18]

    Irvin and P

    J. Irvin and P. Rajpurkar and M. Ko and Y. Yu and S. Ciurea-Ilcus and C. Chute and H. Marklund and B. Haghgoo and R. Ball and K. Shpanskaya and J. Seekins and D. A. Mong and S. S. Halabi and J. K. Sandberg and R. Jones and D. B. Larson and C. P. Langlotz and B. N. Patel and M. P. Lungren and A. Y. Ng , title =

  19. [19]

    Saporta and A

    A. Saporta and A. M. Puli and M. Goldstein and R. Ranganath , title =

  20. [20]

    Saporta and A

    A. Saporta and A. Puli and M. Goldstein and R. Ranganath , title =. Advances in Neural Information Processing Systems , year =

  21. [21]

    E. J. Hu and Y. Shen and P. Wallis and Z. Allen-Zhu and Y. Li and S. Wang and L. Wang and W. Chen , title =. arXiv preprint arXiv:2106.09685 , year =

  22. [22]

    Wang and Y

    X. Wang and Y. Peng and L. Lu and Z. Lu and M. Bagheri and R. M. Summers , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , year =

  23. [23]

    H. Q. Nguyen and K. Lam and L. T. Le and H. H. Pham and D. Q. Tran and D. B. Nguyen and D. D. Le and C. M. Pham and H. T. T. Tong and D. H. Dinh and C. D. Do and L. T. Doan and C. N. Nguyen and B. T. Nguyen and Q. V. Nguyen and A. D. Hoang and H. N. Phan and A. T. Nguyen and P. H. Ho and D. T. Ngo and N. T. Nguyen and N. T. Nguyen and M. Dao and V. Vu , title =

  24. [24]

    H. Q. Nguyen and H. H. Pham and L. T. Linh and M. Dao and L. Khanh , title =

  25. [25]

    Elias and S

    P. Elias and S. Bhave , title =

  26. [26]

    Bhave and V

    S. Bhave and V. Rodriguez and T. Poterucha and S. Mutasa and D. Aberle and K. M. Capaccione and Y. Chen and B. Dsouza and S. Dumeer and J. Goldstein and A. Hodes and J. Leb and M. Lungren and M. Miller and D. Monoky and B. Navot and K. Wattamwar and A. Wattamwar and K. Clerkin and D. Ouyang and E. Ashley and V. K. Topkara and M. Maurer and A. J. Einstein ...

  27. [27]

    Elsharief and S

    S. Elsharief and S. Shurrab and B. Al Jorf and L. J. L. Lopez and K. J. Geras and F. E. Shamout , title =. Proceedings of the Sixth Conference on Health, Inference, and Learning , pages =. 2025 , volume =

  28. [28]

    Indeewara and M

    W. Indeewara and M. Hennayake and K. Rathnayake and T. Ambegoda and D. Meedeniya , title =

  29. [29]

    A. M. Tahir and M. E. H. Chowdhury and Y. Qiblawey and A. Khandakar and T. Rahman and S. Kiranyaz and U. Khurshid and N. Ibtehaz and S. Mahmud and M. Ezeddin , title =

  30. [30]

    Ahishali and A

    M. Ahishali and A. Degerli and M. Yamac and S. Kiranyaz and M. E. H. Chowdhury and K. Hameed and T. Hamid and R. Mazhar and M. Gabbouj , title =. IEEE Access , year =

  31. [31]

    Degerli and M

    A. Degerli and M. Ahishali and S. Kiranyaz and M. E. H. Chowdhury and M. Gabbouj , title =. Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP) , year =

  32. [32]

    Degerli and S

    A. Degerli and S. Kiranyaz and M. E. H. Chowdhury and M. Gabbouj , title =. Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP) , year =

  33. [33]

    Degerli and M

    A. Degerli and M. Ahishali and M. Yamac and S. Kiranyaz and M. E. H. Chowdhury and K. Hameed and T. Hamid and R. Mazhar and M. Gabbouj , title =. Health Information Science and Systems , year =

  34. [34]

    M. E. H. Chowdhury and T. Rahman and A. Khandakar and R. Mazhar and M. A. Kadir and Z. B. Mahbub and K. R. Islam and M. S. Khan and A. Iqbal and N. A. Emadi and M. B. I. Reaz and M. T. Islam , title =. IEEE Access , year =

  35. [35]

    M. Yama. Convolutional Sparse Support Estimator-Based. IEEE Transactions on Neural Networks and Learning Systems , year =

  36. [36]

    Rahman and A

    T. Rahman and A. Khandakar and Y. Qiblawey and A. Tahir and S. Kiranyaz and S. B. A. Kashem and M. T. Islam and S. Al Maadeed and S. M. Zughaier and M. S. Khan and M. E. H. Chowdhury , title =. Computers in Biology and Medicine , year =

  37. [37]

    Candemir and S

    S. Candemir and S. Jaeger and K. Palaniappan and J. P. Musco and R. K. Singh and Z. Xue and A. Karargyris and S. Antani and G. Thoma and C. J. McDonald , title =. IEEE Transactions on Medical Imaging , year =

  38. [38]

    Jaeger and A

    S. Jaeger and A. Karargyris and S. Candemir and J. Siegelman and L. Folio and S. Antani and G. Thoma and C. J. McDonald , title =. Quantitative Imaging in Medicine and Surgery , year =

  39. [39]

    He and X

    K. He and X. Chen and S. Xie and Y. Li and P. Doll. Masked Autoencoders Are Scalable Vision Learners , journal =

  40. [40]

    2024 , eprint=

    Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities , author=. 2024 , eprint=