pith. machine review for the scientific record. sign in

arxiv: 2604.08334 · v1 · submitted 2026-04-09 · 📊 stat.CO · stat.AP

Recognition: unknown

mmid: Multi-Modal Integration and Downstream analyses for healthcare analytics in Python

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 📊 stat.CO stat.AP
keywords multi-modal integrationPython packagehealthcare analyticscardiovascular diseasedata imputationUK Biobankfusion algorithmsdownstream prediction
0
0 comments X

The pith

A Python package fuses imaging, electrical, and genetic heart data to identify cardiovascular disease earlier and more accurately than single sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The mmid package provides a unified way to combine different kinds of health data for analysis tasks like prediction and clustering. In testing on heart imaging, heart rhythm recordings, and genetic risk scores, the combined data allowed spotting cardiovascular problems before they showed up clinically and did so better than using just one kind of data. The package also imputes missing data types effectively, keeping prediction performance stable even when some information is absent, which is common in actual health studies.

Core claim

mmid is a Python package offering multi-modal fusion and imputation along with classification, time-to-event prediction, and clustering under one interface. In the showcase with cardiac magnetic resonance imaging, electrocardiogram, and polygenic risk scores from the UK Biobank, the modalities provided joint and individual information that supported early cardiovascular disease identification before clinical manifestations and with greater effectiveness than any single modality. The package further enabled imputation of partially observed modalities while maintaining downstream prediction performance.

What carries the argument

The mmid Python package, which integrates multiple algorithms for multi-modal data decomposition, imputation, prediction, and clustering through a single command interface and configuration files.

If this is right

  • The combined modalities capture both shared patterns and unique details that aid in early cardiovascular disease detection.
  • The multi-modal approach yields better disease prediction results than using cardiac MRI, ECG, or polygenic risk scores separately.
  • Imputation of missing data modalities incurs no substantial reduction in prediction accuracy for cardiovascular outcomes.
  • The package structure promotes reproducibility by allowing analyses to be run via configuration files.
  • Downstream tasks such as time-to-event analysis and clustering become straightforward within the same multi-modal framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The imputation capability could extend the usable sample size in studies where not all participants have complete multi-modal measurements.
  • Similar integration methods might apply to predicting other conditions with available imaging, signal, and genetic data.
  • Testing the package on datasets from different populations would check if the performance benefits generalize.

Load-bearing premise

The specific algorithms for fusion, imputation, and prediction in the package correctly identify and combine the meaningful information from the three data types without being overly tuned to this one dataset.

What would settle it

A test on new data from a different group of people where the multi-modal predictions show no advantage over the strongest individual data source or where imputed values lead to clearly worse predictions.

Figures

Figures reproduced from arXiv: 2604.08334 by Andrea Mario Vergani, Emanuele Di Angelantonio, Francesca Ieva, Marco Masseroli, Valeria Iapaolo.

Figure 1
Figure 1. Figure 1: mmid outline. mmid flow, with sequential multi-modal integration and downstream analysis, and their inputs and outputs. Multi-modal fusion: the Integrator class The Multi-modal fusion module of mmid - embodied by the Integrator class in Python - acts as a wrapper for unsupervised multi-modal fusion approaches that project whichever number of modality-specific tabular datasets into a single merged represent… view at source ↗
Figure 2
Figure 2. Figure 2: Variability explained by the AJIVE merged representation. Variability explained by the joint and individual components (plus residual) of the Angle-based Joint and Individual Variation Explained merged represen￾tation, across the cardiac magnetic resonance (CMR) imaging, polygenic risk score (PRS) and electrocardiogram (ECG) datasets. This plot was created by the mmid Python package. than modality-specific… view at source ↗
Figure 3
Figure 3. Figure 3: Variability explained by the MOFA+ merged representation. Variability explained by the Multi-Omics Factor Analysis merged factors, across the cardiac magnetic resonance (CMR) imaging and electrocardiogram (ECG) datasets. The blue colour scale represents explained variance (%). This plot was created by the mmid Python package. Disease subtype [clf no imputation] AUC CV AUC test Cohort size [clf imputation] … view at source ↗
read the original abstract

mmid (Multi-Modal Integration and Downstream analyses for healthcare analytics) is a Python package that offers multi-modal fusion and imputation, classification, time-to-event prediction and clustering functionalities under a single interface, filling the gap of sequential data integration and downstream analyses for healthcare applications in a structured and flexible environment. mmid wraps in a unique package several algorithms for multi-modal decomposition, prediction and clustering, which can be combined smoothly with a single command and proper configuration files, thus facilitating reproducibility and transferability of studies involving heterogeneous health data sources. A showcase on personalised cardiovascular risk prediction is used to highlight the relevance of a composite pipeline enabling proper treatment and analysis of complex multi-modal data. We thus employed mmid in an example real application scenario involving cardiac magnetic resonance imaging, electrocardiogram, and polygenic risk scores data from the UK Biobank. We proved that the three modalities captured joint and individual information that was used to (1) early identify cardiovascular disease before clinical manifestations with cardiological relevance, and (2) do it better than single data sources alone. Moreover, mmid allowed to impute partially observable data modalities without considerable performance losses in downstream disease prediction, thus proving its relevance for real-world health analytics applications (which are often characterised by the presence of missing data).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces mmid, a Python package providing a unified interface for multi-modal fusion, imputation, classification, time-to-event prediction, and clustering in healthcare analytics. It wraps existing algorithms for these tasks and uses configuration files to promote reproducibility. The central demonstration is a UK Biobank showcase combining cardiac MRI, ECG, and polygenic risk scores, asserting that the modalities extract joint and individual information enabling early cardiovascular disease identification before clinical manifestations, with superior performance to single modalities, and that imputation of missing modalities incurs no considerable loss in downstream prediction accuracy.

Significance. If the empirical claims are substantiated, the package would address a practical gap by offering an integrated, reproducible workflow for heterogeneous health data, particularly useful for handling missing modalities in real-world cohorts. The showcase suggests utility for pre-symptomatic CVD risk stratification using imaging, electrophysiological, and genetic sources, which could aid transferability of multi-modal studies.

major comments (2)
  1. [Abstract] Abstract: The assertions that the three modalities 'captured joint and individual information' to 'early identify cardiovascular disease before clinical manifestations with cardiological relevance' and 'do it better than single data sources alone' are presented without any quantitative metrics (e.g., AUC, hazard ratios, p-values), baseline comparisons, error bars, or details on the specific fusion/imputation algorithms and validation procedures. This is load-bearing for the central empirical claim of superiority and effective imputation.
  2. [Showcase section] Showcase/results description: The manuscript frames the UK Biobank application as proving the package's relevance but supplies no tables or figures with performance numbers for multi-modal vs. single-modality models, no description of how joint/individual components were extracted or validated, and no assessment of potential dataset-specific biases or overfitting in the chosen cohort.
minor comments (2)
  1. [Abstract] The abstract uses 'we proved' for empirical results; consider rephrasing to 'we demonstrate' or 'we show' to reflect the illustrative nature of the showcase.
  2. [Methods] Ensure the full manuscript includes a dedicated methods subsection detailing the wrapped algorithms, configuration options, and exact pipeline steps used in the showcase for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript introducing the mmid package. The comments correctly identify areas where the empirical claims require stronger quantitative support and clearer presentation of methods and results. We address each major comment below and commit to revisions that will substantiate the key findings without altering the core contributions of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertions that the three modalities 'captured joint and individual information' to 'early identify cardiovascular disease before clinical manifestations with cardiological relevance' and 'do it better than single data sources alone' are presented without any quantitative metrics (e.g., AUC, hazard ratios, p-values), baseline comparisons, error bars, or details on the specific fusion/imputation algorithms and validation procedures. This is load-bearing for the central empirical claim of superiority and effective imputation.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript, we will add concise statements of the main performance metrics from the UK Biobank showcase, including AUC values and improvements for the multi-modal model over single-modality baselines, hazard ratios for the time-to-event analysis, and associated p-values. We will also specify the primary fusion and imputation algorithms employed (e.g., the multi-modal decomposition and imputation routines wrapped by mmid) and note the cross-validation procedure used. These additions will be kept brief to respect abstract length constraints while directly addressing the load-bearing claims. revision: yes

  2. Referee: [Showcase section] Showcase/results description: The manuscript frames the UK Biobank application as proving the package's relevance but supplies no tables or figures with performance numbers for multi-modal vs. single-modality models, no description of how joint/individual components were extracted or validated, and no assessment of potential dataset-specific biases or overfitting in the chosen cohort.

    Authors: The current showcase section is intentionally high-level to demonstrate package usage rather than serve as a full results paper. We acknowledge that this leaves the empirical claims under-supported in the text. We will expand the section to include a summary table of performance metrics comparing multi-modal fusion against single-modality baselines (AUC, concordance index, etc.), with error bars or confidence intervals where applicable. We will describe the extraction of joint and individual components via the specific mmid decomposition functions and their validation through hold-out testing. Finally, we will add a brief discussion of UK Biobank cohort characteristics, potential selection biases, and mitigation steps such as stratified cross-validation to address overfitting concerns. These changes will be integrated into the results and discussion sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a software package description (mmid) that wraps existing multi-modal decomposition, imputation, prediction and clustering algorithms, then demonstrates them empirically on external UK Biobank cardiac MRI, ECG and polygenic risk score data. No mathematical derivations, equations, fitted parameters or theoretical claims are presented whose outputs reduce by construction to the inputs. All performance results are obtained from held-out or external data splits rather than self-defined quantities, and no load-bearing self-citations or ansatzes are invoked to justify core results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on standard assumptions of existing multi-modal ML methods and the representativeness of UK Biobank data.

pith-pipeline@v0.9.0 · 5541 in / 1238 out tokens · 36738 ms · 2026-05-10T17:08:43.203571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references

  1. [1]

    Multimodal Integration in Health Care: Development With Applications in Disease Management

    Yan Hao et al. Multimodal Integration in Health Care: Development With Applications in Disease Management. J Med Internet Res, 27:e76557, 2025

  2. [2]

    Multimodal Learning for Multi-omics: A Survey.World Scientific Annual Review of Artificial Intelligence, 01:2250004, 2023

    Sina Tabakhi et al. Multimodal Learning for Multi-omics: A Survey.World Scientific Annual Review of Artificial Intelligence, 01:2250004, 2023

  3. [3]

    Multi-Omics Factor Analysis—a framework for unsupervised integration of multi- omics data sets.Molecular Systems Biology, 14(6):e8124, 2018

    Ricard Argelaguet et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi- omics data sets.Molecular Systems Biology, 14(6):e8124, 2018

  4. [4]

    Joint and Individual variation explained (JIVE) for integrated analysis of multiple data types

    Eric Lock et al. Joint and Individual variation explained (JIVE) for integrated analysis of multiple data types. The annals of applied statistics, 7:523–542, 2013

  5. [5]

    Angle-based joint and individual variation explained.Journal of Multivariate Analysis, 166:241–265, 2018

    Qing Feng et al. Angle-based joint and individual variation explained.Journal of Multivariate Analysis, 166:241–265, 2018. 19

  6. [6]

    Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling.Frontiers in Oncology, 10, 2020

    Marco Chierici et al. Integrative Network Fusion: A Multi-Omics Approach in Molecular Profiling.Frontiers in Oncology, 10, 2020

  7. [7]

    MIDAA: deep archetypal analysis for interpretable multi-omic data integration based on biological principles.Genome Biology, 26, 2025

    Salvatore Milite et al. MIDAA: deep archetypal analysis for interpretable multi-omic data integration based on biological principles.Genome Biology, 26, 2025

  8. [8]

    Multi-view learning and omics integration: a unified perspective with applications to healthcare.MOX Reports, 2026

    Valeria Iapaolo et al. Multi-view learning and omics integration: a unified perspective with applications to healthcare.MOX Reports, 2026

  9. [10]

    MUON: multimodal omics analysis framework.Genome Biology, 23:42, 2022

    Danila Bredikhin et al. MUON: multimodal omics analysis framework.Genome Biology, 23:42, 2022

  10. [11]

    mixOmics: An R package for ‘omics feature selection and multiple data integration

    Florian Rohart et al. mixOmics: An R package for ‘omics feature selection and multiple data integration. PLOS Computational Biology, 13(11):1–19, 2017

  11. [12]

    A machine learning approach for multimodal data fusion for survival prediction in cancer patients.npj Precision Oncology, 9:128, 2025

    Nikos Nikolaou et al. A machine learning approach for multimodal data fusion for survival prediction in cancer patients.npj Precision Oncology, 9:128, 2025

  12. [13]

    The role of cardiac magnetic resonance (CMR) in the diagnosis of cardiomy- opathy: A systematic review.Malawi Medical Journal, 30(4):291–295, 2018

    Henry Anselmo Mayala et al. The role of cardiac magnetic resonance (CMR) in the diagnosis of cardiomy- opathy: A systematic review.Malawi Medical Journal, 30(4):291–295, 2018

  13. [14]

    Research on atrial fibrillation diagnosis in electrocardiograms based on CLA-AF model

    Jiajia Si et al. Research on atrial fibrillation diagnosis in electrocardiograms based on CLA-AF model. European Heart Journal - Digital Health, 6(1):82–95, 2024

  14. [15]

    Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses.PLOS Medicine, 18(1):1–22, 2021

    Luanluan Sun et al. Polygenic risk scores in cardiovascular risk prediction: A cohort study and modelling analyses.PLOS Medicine, 18(1):1–22, 2021

  15. [16]

    MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data.Genome Biology, 21(1):111, 2020

    Ricard Argelaguet et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data.Genome Biology, 21(1):111, 2020

  16. [17]

    mvlearn: Multiview Machine Learning in Python.Journal of Machine Learning Research, 22(109):1–7, 2021

    Ronan Perry et al. mvlearn: Multiview Machine Learning in Python.Journal of Machine Learning Research, 22(109):1–7, 2021

  17. [18]

    Statsmodels: Econometric and Statistical Modeling with Python

    Skipper Seabold and Josef Perktold. Statsmodels: Econometric and Statistical Modeling with Python. In Proceedings of the 9th Python in Science Conference, pages 92–96, 2010

  18. [19]

    Tianqi Chen et al.xgboost: Extreme Gradient Boosting, 2025

  19. [20]

    Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

    Fabian Pedregosa et al. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

  20. [21]

    lifelines: survival analysis in Python.Journal of Open Source Software, 4(40):1317, 2019

    Cameron Davidson-Pilon. lifelines: survival analysis in Python.Journal of Open Source Software, 4(40):1317, 2019

  21. [22]

    scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn.Journal of Machine Learning Research, 21(212):1–6, 2020

    Sebastian P¨olsterl. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn.Journal of Machine Learning Research, 21(212):1–6, 2020

  22. [23]

    Katzman et al

    Jared L. Katzman et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network.BMC Medical Research Methodology, 18(1):24, 2018

  23. [24]

    PySurvival: Open source package for Survival Analysis modeling, 2019

    Stephane Fotso et al. PySurvival: Open source package for Survival Analysis modeling, 2019

  24. [25]

    DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks

    Changhee Lee et al. DeepHit: A Deep Learning Approach to Survival Analysis With Competing Risks. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018

  25. [26]

    Time-to-Event Prediction with Neural Networks and Cox Regression.Journal of Machine Learning Research, 20(129):1–30, 2019

    H˚avard Kvamme et al. Time-to-Event Prediction with Neural Networks and Cox Regression.Journal of Machine Learning Research, 20(129):1–30, 2019

  26. [27]

    Ritchie et al

    Scott C. Ritchie et al. Combined clinical, metabolomic, and polygenic scores for cardiovascular risk prediction. European Heart Journal, page ehaf947, 2025

  27. [28]

    D’Agostino et al

    Ralph B. D’Agostino et al. General Cardiovascular Risk Profile for Use in Primary Care.Circulation, 117(6):743–753, 2008

  28. [29]

    Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.BMJ, 357, 2017

    Julia Hippisley-Cox et al. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.BMJ, 357, 2017

  29. [30]

    Family history of cardiovascular disease.Canadian family physician M´edecin de famille canadien, 60(11):1016, 2014

    Michael Kolber and Cathy Scrimshaw. Family history of cardiovascular disease.Canadian family physician M´edecin de famille canadien, 60(11):1016, 2014. 20

  30. [31]

    UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age.PLOS Medicine, 12(3):1–10, 2015

    Cathie Sudlow et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age.PLOS Medicine, 12(3):1–10, 2015

  31. [32]

    A population-based phenome-wide association study of cardiac and aortic structure and function.Nature Medicine, 26(10):1654–1662, 2020

    Wenjia Bai et al. A population-based phenome-wide association study of cardiac and aortic structure and function.Nature Medicine, 26(10):1654–1662, 2020

  32. [33]

    Thompson et al

    Deborah J. Thompson et al. A systematic evaluation of the performance and properties of the UK Biobank Polygenic Risk Score (PRS) Release.PLOS ONE, 19(9):1–24, 2024

  33. [34]

    Scanner-Agnostic MRI Harmonization via SSIM-Guided Disentanglement.arXiv, 2025

    Luca Caldera et al. Scanner-Agnostic MRI Harmonization via SSIM-Guided Disentanglement.arXiv, 2025. 21 Supplementary tables Disease subtype Defined as the first among these events Excluding subjects that experienced any of these events before baseline AA ICD-10:I48 ICD-10:I48 CAD ICD-10:I21, I22, I23, I24.1, I25.2 ICD-9:410, 411, 412, 429.79 OPCS-4:K40.1-...