dashi: A Python library for Dataset Shift Characterization to Support Trustworthy AI Development and Deployment
Pith reviewed 2026-06-28 22:56 UTC · model grok-4.3
The pith
The dashi Python library quantifies dataset shifts with unsupervised information geometry metrics and supervised performance checks to support trustworthy AI.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
dashi is a Python library providing a dual approach to dataset shift analysis: an unsupervised method that uses information geometry and non-parametric statistical manifolds to characterize data variability through metrics such as Global Probabilistic Deviation and Source Probabilistic Outlyingness, together with Information Geometric Temporal plots, and a supervised method that quantifies model performance degradation, with both methods applicable across user-defined temporal and domain or source batches.
What carries the argument
The dual unsupervised-supervised framework that applies information geometry and non-parametric statistical manifolds for variability metrics alongside performance degradation analysis on temporal and multi-source batches.
If this is right
- Shifts can be quantified and visualized across both temporal and multi-source batches using the supplied metrics.
- Model performance changes due to shifts can be tracked through the supervised component.
- Interactive analytics enable assessment of data coherence to guide AI pipeline decisions.
- The tools apply to training and operational stages to help maintain reliability in health AI systems.
Where Pith is reading between the lines
- The same metrics could support ongoing monitoring once a model is deployed rather than only during development.
- Integration with existing training workflows might allow automatic retraining triggers when certain shift thresholds are crossed.
- The library's structure could be tested on non-health domains such as financial or sensor data where distribution changes are also common.
Load-bearing premise
That the unsupervised metrics derived from information geometry and non-parametric manifolds deliver actionable characterization of shifts that meaningfully supports AI trustworthiness and safety.
What would settle it
A controlled test on health data in which shifts detected and measured by dashi show no consistent correlation with actual drops in model accuracy or increases in safety risks.
Figures
read the original abstract
The Artificial Intelligence (AI) life cycle requires a thorough understanding of the underlying data dynamics for robust, safe and cost-effective AI development and use. Dataset shifts are defined as changes between train and test data distributions. Whether occurring over time (temporal) or across different sites (multi-source), they can severely degrade model performance and compromise data quality. This is particularly important in health AI, where the safety and fundamental rights of patients can be severely affected by uncontrolled shifts both at training and operational stages. While the theoretical foundations of covariate, prior, and concept shifts are well established, there is a lack of accessible and comprehensive software tools to perform their analysis. We introduce dashi, an open-source Python library designed for the exploration, quantification, and characterization of dataset shifts. dashi provides a dual approach: an unsupervised approach that leverages information geometry and non-parametric statistical manifolds to data variability characterization and analysis (e.g., Information Geometric Temporal plots and Multi-Source Variability metrics like Global Probabilistic Deviation and Source Probabilistic Outlyingness), and a supervised approach that quantifies and characterizes model performance degradation. Both unsupervised and supervised approaches work across user-defined temporal and domain/source batches. We demonstrate the utility of dashi on three simulated and real-world health AI case studies on gestational diabetes mellitus, COVID-19 and emergency medical dispatch. By providing interactive visual analytics and variability metrics, dashi supports trustworthiness of AI life cycle stages enabling robust and safe machine learning pipelines through the assessment of data coherence and AI performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces dashi, an open-source Python library for the exploration, quantification, and characterization of dataset shifts. It offers a dual approach: an unsupervised method based on information geometry and non-parametric statistical manifolds (including Information Geometric Temporal plots and metrics such as Global Probabilistic Deviation and Source Probabilistic Outlyingness) plus a supervised method for quantifying model performance degradation. Both operate over user-defined temporal and domain/source batches. Utility is demonstrated via three simulated and real-world health AI case studies (gestational diabetes mellitus, COVID-19, and emergency medical dispatch). The central claim is that the library's interactive visual analytics and variability metrics support trustworthy AI development and deployment by assessing data coherence.
Significance. If the implemented metrics and visualizations prove reliable and actionable, an open-source library providing both unsupervised geometric and supervised performance-based shift tools would address a genuine gap in accessible software for dataset shift analysis, particularly in high-stakes health AI applications where shifts can affect safety and rights. The dual unsupervised/supervised design and batch flexibility are explicit strengths that could facilitate reproducible pipelines.
major comments (2)
- [Case Studies] Case Studies section: the three demonstrations are described at a high level but the manuscript provides no quantitative validation metrics, error analysis, or comparison against existing shift-detection baselines for the unsupervised metrics (Global Probabilistic Deviation, Source Probabilistic Outlyingness). This is load-bearing for the claim that the tools meaningfully support AI trustworthiness and safety.
- [Methods] Methods / Unsupervised Approach: the information-geometric and non-parametric manifold constructions are referenced but lack explicit algorithmic pseudocode, parameter settings, or sensitivity analysis, preventing independent assessment of whether the metrics are robust or merely descriptive.
minor comments (2)
- The manuscript would benefit from a dedicated 'Availability and Installation' subsection that includes the exact GitHub or PyPI link, license, and minimum Python/dependency versions.
- Notation for the variability metrics is introduced in prose; adding a short mathematical definitions table or appendix would improve clarity for readers unfamiliar with information geometry.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for strengthening the manuscript. We agree that both major points require attention and will revise the paper to address them directly.
read point-by-point responses
-
Referee: [Case Studies] Case Studies section: the three demonstrations are described at a high level but the manuscript provides no quantitative validation metrics, error analysis, or comparison against existing shift-detection baselines for the unsupervised metrics (Global Probabilistic Deviation, Source Probabilistic Outlyingness). This is load-bearing for the claim that the tools meaningfully support AI trustworthiness and safety.
Authors: We agree that the case studies, as currently presented, are primarily illustrative and do not include the requested quantitative validation, error analysis, or baseline comparisons. This limits the strength of the claims regarding support for AI trustworthiness. In the revised manuscript we will add quantitative evaluations of the unsupervised metrics (e.g., correlation with known distribution changes, comparison of Global Probabilistic Deviation and Source Probabilistic Outlyingness against baselines such as Kolmogorov-Smirnov tests and other shift detectors), along with error analysis and discussion of how these metrics relate to downstream model performance degradation. revision: yes
-
Referee: [Methods] Methods / Unsupervised Approach: the information-geometric and non-parametric manifold constructions are referenced but lack explicit algorithmic pseudocode, parameter settings, or sensitivity analysis, preventing independent assessment of whether the metrics are robust or merely descriptive.
Authors: We concur that the absence of pseudocode, explicit parameter settings, and sensitivity analysis hinders reproducibility and independent evaluation. The revised Methods section will include algorithmic pseudocode for the core unsupervised procedures (information-geometric manifold construction, Global Probabilistic Deviation, and Source Probabilistic Outlyingness), the specific parameter values used in the library implementation, and a sensitivity analysis examining robustness to key hyperparameters and data characteristics. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a software library for dataset shift analysis with no mathematical derivations, equations, predictions, or fitted parameters. All claims concern implementation of existing concepts (information geometry, variability metrics) and case-study demonstrations; no step reduces by construction to its own inputs, and no self-citation chain is load-bearing for a theoretical result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Sáez C, Ferri P, García-Gómez JM. Resilient Artificial Intelligence in Health: Synthesis and Research Agenda Toward Next-Generation Trustworthy Clinical Decision Support. J Med Internet Res. JMIR Publications Inc., Toronto, Canada; 2024; doi: 10.2196/50295
-
[2]
Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. N Engl J Med. N Engl J Med; 2019; doi: 10.1056/nejmra1814259
-
[3]
Key challenges for delivering clinical impact with artificial intelligence
Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine 2019 17:1. BioMed Central; 2019; doi: 10.1186/s12916-019-1426-2
-
[4]
The Clinician and Dataset Shift in Artificial Intelligence
Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, et al.. The Clinician and Dataset Shift in Artificial Intelligence. N Engl J Med. N Engl J Med; 2021; doi: 10.1056/nejmc2104626
-
[5]
A unifying view on dataset shift in classification
Moreno-Torres JG, Raeder T, Alaiz-Rodríguez R, Chawla N V, Herrera F. A unifying view on dataset shift in classification. Pattern Recognit. 2012; doi: 10.1016/j.patcog.2011.06.019
-
[6]
Dataset shift in machine learning
Quiñonero-Candela J. Dataset shift in machine learning. Neural information processing series. Cambridge, Mass.: MIT Press
-
[7]
Fernández-Narro D, Ferri P, García-Gómez JM, Sáez C. Quantifying Epistemic Uncertainty in Predictions for Safer Health AI Performance Under Dataset Shifts. Stud Health Technol Inform. IOS Press; 2025; doi: 10.3233/SHTI251493
-
[8]
Fernández-Narro D, Ferri P, Gutiérrez-Sacristán A, García-Gómez JM, Sáez C. Unsupervised Characterization of Temporal Dataset Shifts as an Early Indicator of AI Performance Variations: Evaluation Study Using the Medical Information Mart for Intensive Care-IV Dataset. JMIR Med Inform. JMIR Medical Informatics; 2025; doi: 10.2196/78309
-
[10]
Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. Journal of the American Medical Informatics Association. Oxford Academic; 2013; doi: 10.1136/amiajnl-2011-000681
-
[11]
The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review
Schwabe D, Becker K, Seyferth M, Klaß A, Schaeffter T. The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review. npj Digital Medicine 2024 7:1. Nature Publishing Group; 2024; doi: 10.1038/s41746-024-01196-4
-
[12]
Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross- sectional study. PLoS Med. Public Library of Science; 2018; doi: 10.1371/journal.pmed.1002683
-
[13]
Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks
Nestor B, McDermott MBA, Boag W, Berner G, Naumann T, Hughes MC, et al.. Feature Robustness in Non-stationary Health Records: Caveats to Deployable Model Performance in Common Clinical Machine Learning Tasks. Proc. Mach. Learn. Res. PMLR; p. 381–405
-
[14]
Sáez C, Rodrigues PP, Gama J, Robles M, García-Gómez JM. Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality. Data Min Knowl Discov. 2015; doi: 10.1007/s10618-014-0378-6
-
[16]
EHRtemporalVariability: delineating temporal data-set shifts in electronic health records
Sáez C, Gutiérrez-Sacristán A, Kohane I, García-Gómez JM, Avillach P. EHRtemporalVariability: delineating temporal data-set shifts in electronic health records. Gigascience. 2020; doi: 10.1093/gigascience/giaa079
-
[17]
Sáez C, Zurriaga O, Pérez-Panadés J, Melchor I, Robles M, García-Gómez JM. Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories. Journal of the American Medical Informatics Association. Oxford Academic; 2016; doi: 10.1093/JAMIA/OCW010
-
[18]
Sáez C, García-Gómez JM. Kinematics of Big Biomedical Data to characterize temporal variability and seasonality of data repositories: Functional Data Analysis of data temporal evolution over non-parametric statistical manifolds. Int J Med Inform. Elsevier; 2018; doi: 10.1016/j.ijmedinf.2018.09.015
-
[19]
Sáez C, Robles M, García-Gómez JM. Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances. Stat Methods Med Res. SAGE Publications Ltd; 2017; doi: 10.1177/0962280214545122
-
[20]
https://www.gob.mx/salud/documentos/datos-abiertos-bases-historicas-direccion- general-de-epidemiologia Accessed 2026 Apr 21
: Datos Abiertos Bases Históricas | Secretaría de Salud | Gobierno | gob.mx. https://www.gob.mx/salud/documentos/datos-abiertos-bases-historicas-direccion- general-de-epidemiologia Accessed 2026 Apr 21
2026
-
[21]
Ferri P, Sáez C, Félix-De Castro A, Sánchez-Cuesta P, García-Gómez JM. An end-to- end solution for out-of-hospital emergency medical dispatch triage based on multimodal and continual deep learning. Artif Intell Med. Elsevier; 2025; doi: 10.1016/J.ARTMED.2025.103264
-
[22]
BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights
Remy F, Demuynck K, Demeester T. BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical Knowledge Graph Insights. 2023
2023
-
[23]
Ferri P, Lomonaco V, Passaro LC, Félix-De Castro A, Sánchez-Cuesta P, Sáez C, et al.. Deep continual learning for medical call incidents text classification under the presence of dataset shifts. Comput Biol Med. Pergamon; 2024; doi: 10.1016/j.compbiomed.2024.108548
-
[24]
Towards an Analytical System for Supervising Fairness, Robustness, and Dataset Shifts in Health AI
Sánchez-García Á, Fernández-Narro D, Ferri P, García-Gómez JM, Sáez C. Towards an Analytical System for Supervising Fairness, Robustness, and Dataset Shifts in Health AI. Stud Health Technol Inform. IOS Press; 2025; doi: 10.3233/SHTI251537
-
[25]
Blasco-Calafat A, Blanes-Selva V, Fragner T, Doñate-Martínez A, Alhambra-Borrás T, Gawronska J, et al.. Multisource Coherence Analysis of the First European Multicenter Cohort Study for Cancer Prevention in People Experiencing Homelessness: Data Quality Study. JMIR Med Inform. JMIR Medical Informatics; 2025; doi: 10.2196/73596
-
[26]
Open-Source Drift Detection Tools in Action: Insights from Two Use Cases Davor Stjelja
Müller R, Abdelaal M, Oy Helsinki G, DavorStjelja F. Open-Source Drift Detection Tools in Action: Insights from Two Use Cases Davor Stjelja. Proceedings of ACM Conference (Conference’17). 2024; doi: 10.1145/nnnnnnn.nnnnnnn
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.