arxiv: 2604.15740 · v1 · submitted 2026-04-17 · 💻 cs.CY

Recognition: unknown

Evidence Sufficiency Under Delayed Ground Truth: Proxy Monitoring for Risk Decision Systems

Oleg Solozobov

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:14 UTC · model grok-4.3

classification 💻 cs.CY

keywords evidence sufficiencydelayed ground truthproxy monitoringdrift detectionrisk decision systemsgovernancemachine learningfraud detection

0 comments

The pith

Delayed outcome labels degrade evidence quality in four measurable dimensions that proxy indicators can monitor without waiting for results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Machine learning risk systems in fraud, credit, and clinical domains must decide before outcome labels arrive, leaving a blind period where evidence quality erodes. The paper defines evidence sufficiency through four dimensions and shows how three kinds of drift drive distinct degradation paths in those dimensions. It supplies a proxy monitoring system built from seven unlabeled measurement categories that estimates the current sufficiency level and flags which drift types remain invisible. Experiments on a large fraud dataset confirm that the proxies catch covariate and mixed drift at 100 percent while missing pure concept drift, and that sufficiency scores fall steadily over time with concept drift causing the steepest drop. The result is a governance instrument that turns drift signals into auditable readiness assessments whose blind spots are explicitly mapped.

Core claim

The paper formalizes an evidence sufficiency model with four dimensions—completeness, freshness, reliability, and representativeness—plus a decision-readiness gate that quantifies how label latency degrades evidence. It maps three drift types to dimension-specific degradation trajectories and introduces a proxy indicator framework of seven measurement categories that estimates sufficiency loss without labels, together with coverage mappings and per-drift blind spots. On the IEEE-CIS Fraud Detection dataset with controlled drift injection, the composite proxy score detects covariate and mixed drift at 100 percent while concept drift without feature change remains undetected, matching the fact

What carries the argument

The four-dimension evidence sufficiency model together with the seven-category proxy indicator framework that converts unlabeled measurements into drift-specific degradation trajectories and auditable readiness scores.

If this is right

Governance teams can set explicit thresholds for when a risk system remains decision-ready despite pending labels.
Drift alerts can be converted into dimension-specific sufficiency reports rather than generic warnings.
The framework reveals which drift types require labeled data because they are invisible to unsupervised proxies.
Blind-spot mappings allow organizations to add targeted supplementary checks for specific drift patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same proxy approach could be adapted to other delayed-feedback domains such as medical treatment outcomes or long-horizon forecasting.
Calibration of sufficiency thresholds to each deployment's risk tolerance would be required before operational use.
Combining the proxies with occasional minimal label sampling could shrink the blind spot for concept drift without full retraining.
The model offers a way to make existing governance frameworks more sensitive to the timing of evidence rather than only to model accuracy.

Load-bearing premise

The seven proxy categories can accurately estimate degradation in all four sufficiency dimensions without access to outcome labels and the drift-to-trajectory mappings match real degradation mechanisms.

What would settle it

A deployment in which the proxy-derived sufficiency score fails to track the actual drop in decision quality once the delayed labels finally arrive, or in which concept drift produces undetected performance loss beyond the theoretical limit.

read the original abstract

Machine learning systems in fraud detection, credit scoring, and clinical risk assessment operate under delayed ground truth: outcome labels arrive days to months after the decision they evaluate. During this blind period, governance evidence degrades through mechanisms that neither drift detection methods nor governance frameworks adequately address. This paper formalizes an evidence sufficiency model with four dimensions (completeness, freshness, reliability, representativeness) and a decision-readiness gate that quantifies how label latency degrades evidence quality. The model maps three drift types to dimension-specific degradation trajectories. A complementary proxy indicator framework comprising seven measurement categories estimates sufficiency degradation without labels, with explicit coverage mapping and characterized blind spots per drift type. Evaluation on the IEEE-CIS Fraud Detection dataset (~590K transactions) with controlled drift injection shows that composite proxy monitoring detects covariate and mixed drift with 100% detection rate, while concept drift without feature change remains undetected -- consistent with the theoretical impossibility of unsupervised detection when P(X) is unchanged. Blind period simulation confirms monotone sufficiency degradation, with concept drift degrading fastest (S=0.242 at day 60 vs 0.418 for no-drift). The framework contributes a governance sufficiency monitoring instrument; its value lies in translating drift signals into auditable sufficiency assessments with characterized blind spots. Mapping sufficiency levels to governance actions requires deployment-specific calibration beyond this study's scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a structured monitoring framework for evidence in delayed-label risk systems, but the proxies' ability to track specific sufficiency dimensions isn't directly validated in the results.

read the letter

The paper introduces a monitoring approach for evidence quality in delayed-label systems like fraud detection. It formalizes sufficiency across four dimensions and supplies proxies to estimate degradation without waiting for outcomes. The new part is the decision-readiness gate tied to label latency and the drift-specific degradation paths. The proxy framework adds seven categories with coverage details and notes which drift types leave blind spots. On the fraud dataset with controlled injections, the proxies catch covariate and mixed drift every time but miss concept drift, as expected, and the simulated blind period shows monotone drops in the overall score. That evaluation is straightforward and matches the stated expectations. It gives a concrete sense of how the framework behaves under different conditions. The weaker part is the lack of a direct test linking the proxy outputs to the individual dimension degradations. The results report detection and overall S values, but they do not show correlations or recovery accuracy between the seven proxies and the four dimensions during the no-label phase. This makes the sufficiency estimates more of a modeled construct than a validated measurement. Readers working on governance for high-stakes risk models would get the most from it, especially those looking beyond standard drift detectors. The blind-spot characterization could help them design better monitoring. I would send this to peer review. The topic fills a real operational gap, and the current results are enough to start a conversation even if the proxy-to-dimension mapping needs tighter empirical support in revisions.

Referee Report

2 major / 2 minor

Summary. The paper formalizes an evidence sufficiency model with four dimensions (completeness, freshness, reliability, representativeness) and a decision-readiness gate for ML risk systems under delayed ground truth. It maps three drift types to dimension-specific degradation trajectories and introduces a proxy indicator framework with seven measurement categories to estimate sufficiency degradation without labels. Evaluation on the IEEE-CIS Fraud Detection dataset (~590K transactions) with controlled drift injection reports 100% detection for covariate and mixed drift, zero detection for concept drift without feature change, and monotone sufficiency degradation in blind-period simulations (e.g., S=0.242 for concept drift vs. 0.418 for no-drift at day 60). The framework is positioned as a governance monitoring instrument with characterized blind spots.

Significance. If the proxy-to-sufficiency mapping holds, the work provides a useful governance instrument for high-stakes domains by translating drift signals into auditable sufficiency assessments during label latency, with explicit coverage and blind spots. Strengths include the public dataset, controlled drift injection experiments, and reproducible detection rates consistent with theoretical expectations for unsupervised settings.

major comments (2)

[Evaluation on the IEEE-CIS Fraud Detection dataset] Evaluation section: the reported 100% proxy-based detection for covariate/mixed drift and monotone S degradation (S=0.242 vs. 0.418 at day 60) do not include quantitative validation (e.g., correlation, recovery error, or dimension-wise alignment metrics) showing that the seven proxy categories recover or track the four-dimension sufficiency degradation trajectories in the absence of labels. This leaves the central claim that proxies estimate sufficiency degradation as an untested modeling assumption.
[Proxy indicator framework] Proxy indicator framework: the explicit coverage mapping from the seven proxy measurement categories to the four sufficiency dimensions is presented, but the blind-period simulation and drift-injection results do not test whether these proxies produce estimates that correlate with the simulated degradation levels across completeness, freshness, reliability, and representativeness when ground truth is unavailable.

minor comments (2)

[Abstract] Abstract: lacks details on the exact definitions or formulas for the seven proxy categories and the decision-readiness gate.
[Blind period simulation] Blind period simulation: clarify the aggregation formula used to compute the composite sufficiency score S from the four dimensions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback correctly identifies a gap in the empirical validation of the proxy-to-sufficiency mapping. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation on the IEEE-CIS Fraud Detection dataset] Evaluation section: the reported 100% proxy-based detection for covariate/mixed drift and monotone S degradation (S=0.242 vs. 0.418 at day 60) do not include quantitative validation (e.g., correlation, recovery error, or dimension-wise alignment metrics) showing that the seven proxy categories recover or track the four-dimension sufficiency degradation trajectories in the absence of labels. This leaves the central claim that proxies estimate sufficiency degradation as an untested modeling assumption.

Authors: We agree that the current evaluation relies on overall detection rates and aggregate S trajectories without reporting direct quantitative alignment between the seven proxy categories and the four dimension-specific degradation levels. The proxy mapping was constructed from definitional coverage (e.g., volume and missingness proxies for completeness), and the controlled experiments confirm expected detection behavior, but this does not substitute for explicit correlation or recovery metrics. In the revised manuscript we will add, in the Evaluation section, dimension-wise proxy scores, Pearson correlations with simulated degradation per dimension, and mean absolute recovery error for the blind-period simulations on the IEEE-CIS dataset. revision: yes
Referee: [Proxy indicator framework] Proxy indicator framework: the explicit coverage mapping from the seven proxy measurement categories to the four sufficiency dimensions is presented, but the blind-period simulation and drift-injection results do not test whether these proxies produce estimates that correlate with the simulated degradation levels across completeness, freshness, reliability, and representativeness when ground truth is unavailable.

Authors: The referee is correct that the reported results emphasize composite detection and monotone aggregate S degradation rather than per-dimension correlation tests under label latency. While the coverage mapping is explicit in the framework section, the empirical link to simulated trajectories was not quantified. We will revise the blind-period simulation subsection to include these tests: time-series correlations between each proxy category and the corresponding dimension degradation, plus cross-dimension alignment statistics, using the same controlled drift injections. This will provide direct evidence that the proxies track sufficiency degradation when ground truth is unavailable. revision: yes

Circularity Check

0 steps flagged

No circularity: definitions and empirical measurements remain independent

full rationale

The paper introduces an evidence sufficiency model with four dimensions and a proxy indicator framework with seven categories as independent constructs, then evaluates them via controlled drift injection on the external IEEE-CIS Fraud Detection dataset. Reported outcomes (100% detection for covariate/mixed drift, S values at day 60, blind spots for concept drift) are measured simulation results rather than quantities recovered by construction from fitted parameters or self-referential mappings. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation; the mapping from proxies to dimensions is presented as an explicit modeling choice with characterized coverage, not as a tautology. The central claims therefore rest on external data and stated assumptions rather than reducing to the inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on newly introduced conceptual constructs (the four-dimensional model and proxy framework) whose validity is supported only by the described simulation and dataset evaluation; no external benchmarks or independent derivations are referenced.

axioms (2)

domain assumption The four dimensions of completeness, freshness, reliability, and representativeness adequately capture all mechanisms by which label latency degrades evidence quality.
Invoked when formalizing the evidence sufficiency model and decision-readiness gate.
domain assumption Proxy indicators drawn from seven measurement categories can estimate sufficiency degradation without access to delayed labels.
Basis for the complementary proxy indicator framework and its coverage mapping.

invented entities (2)

Evidence sufficiency model with decision-readiness gate no independent evidence
purpose: Quantifies how label latency degrades evidence quality across four dimensions
Newly formalized construct; no independent evidence or prior literature reference provided.
Proxy indicator framework with seven measurement categories no independent evidence
purpose: Estimates sufficiency degradation without labels and maps blind spots per drift type
Introduced as complementary monitoring instrument; no external validation cited.

pith-pipeline@v0.9.0 · 5531 in / 1629 out tokens · 42704 ms · 2026-05-10T08:14:31.417270+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Label-Free Detection of Governance Evidence Degradation in Risk Decision Systems
cs.CY 2026-04 unverdicted novelty 6.0

A composite multi-proxy framework detects harmful drift in label-free risk decision systems and enables graduated governance alerts.

Reference graph

Works this paper leans on

47 extracted references · 43 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Ackerman, S., Dube, P., Farchi, E., Raz, O., & Zalmanovici, M. (2021). Machine Learning Model Drift Detection Via Weak Data Slices. In Workshop on Deep Learning for Testing and Testing for Deep Learning (pp. 1–7). https://doi.org/10.1109/DeepTest52559.2021.00007

work page doi:10.1109/deeptest52559.2021.00007 2021
[2]

Agrahari, S., & Singh, A. (2021). Concept Drift Detection in Data Stream Mining : A literature review. Journal of King Saud University: Computer and Information Sciences , 34(10), 9523–9540. https://doi.org/10.1016/j.jksuci.2021.11.006

work page doi:10.1016/j.jksuci.2021.11.006 2021
[3]

Amekoe, K.M., Lebbah, M., Jaffre, G., Azzag, H., & Dagdia, Z.C. (2024). Evaluating the Eﬀicacy of Instance Incremental vs. Batch Learning in Delayed Label Environments: An Empirical Study on Tabular Data Streaming for Fraud Detection. arXiv.org [Preprint]. https://doi.org/10.48550/arXiv.2409.10111

work page doi:10.48550/arxiv.2409.10111 2024
[4]

Amoukou, S.I., Bewley, T., Mishra, S., Lécué, F., Magazzeni, D., & Veloso, M. (2024). Sequential Harmful Shift Detection Without Labels. Neural Information Processing Systems [Preprint]. https://doi.org/10.48550/arXiv.2412.12910

work page doi:10.48550/arxiv.2412.12910 2024
[5]

Baier, L., Schlör, T., Schöffer, J., & Kühl, N. (2021). Detecting Concept Drift With Neural Network Model Uncertainty. In Hawaii International Conference on System Sciences . https: //doi.org/10.24251/hicss.2023.104

work page doi:10.24251/hicss.2023.104 2021
[6]

Bayram, F., Ahmed, B.S., & Kassler, A. (2022). From Concept Drift to Model Degradation: An Overview on Performance-Aware Drift Detectors. Knowledge-Based Systems [Preprint]. https://doi.org/10.48550/arXiv.2203.11070 Board of Governors of the Federal Reserve System; Oﬀice of the Comptroller of the Currency (2011). SR 11-7: Guidance on Model Risk Management...

work page doi:10.48550/arxiv.2203.11070 2022
[7]

Butt, T., Iqbal, M., & Arshad, N. (2026). From Policy to Pipeline: A Governance Framework for AI Development and Operations Pipelines. IEEE Access, 14, 1–27. https://doi.org/10.1 109/ACCESS.2025.3647479

work page arXiv 2026
[8]

Casimiro, M., Soares, D., Garlan, D., Rodrigues, L., & Romano, P. (2024). Self-adapting Machine Learning-based Systems via a Probabilistic Model Checking Framework. ACM Transactions on Autonomous and Adaptive Systems , 19(3), 1–30. https://doi.org/10.1145/ 3648682

2024
[9]

Chen, L., Zaharia, M., & Zou, J.Y. (2022). Estimating and Explaining Model Performance When Both Covariates and Labels Shift. Neural Information Processing Systems [Preprint]. https://doi.org/10.48550/arXiv.2209.08436

work page doi:10.48550/arxiv.2209.08436 2022
[10]

Daruna, S. (2026). Human-in-the-Loop Frameworks in Automated Decision Systems: A Systematic Analysis of Design Patterns, Performance Characteristics, and Deployment Con- siderations. The American Journal of Engineering and Technology , 8(02), 17–25. https: //doi.org/10.37547/tajet/volume08issue02-03

work page doi:10.37547/tajet/volume08issue02-03 2026
[11]

Elder, B., Arnold, M., Murthi, A., & Navrátil, J. (2020). Learning Prediction Intervals for Model Performance. In AAAI Conference on Artificial Intelligence (pp. 7305–7313). https://doi.org/10.1609/aaai.v35i8.16897 European Parliament; Council of the European Union (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (...

work page doi:10.1609/aaai.v35i8.16897 2020
[12]

Gama, J., Medas, P., Castillo, G., & Rodrigues, P. (2004). Learning with Drift Detection. In Brazilian Symposium on Artificial Intelligence (pp. 0–4). https://doi.org/10.1007/978-3- 540-28645-5_29

work page doi:10.1007/978-3- 2004
[13]

Garg, S., Balakrishnan, S., Lipton, Z.C., Neyshabur, B., & Sedghi, H. (2022). Leveraging Unlabeled Data to Predict Out-of-Distribution Performance . https://arxiv.org/abs/2201.042 34

2022
[14]

Gauthier, E., Bach, F., & Jordan, M.I. (2025). E-Values Expand the Scope of Conformal Prediction. arXiv.org [Preprint]. https://doi.org/10.48550/arXiv.2503.13050

work page doi:10.48550/arxiv.2503.13050 2025
[15]

Gemaque, R.N., Costa, A., Giusti, R., & Santos, E. (2020). An overview of unsupervised drift detection methods. WIREs Data Mining Knowl. Discov. , 10(6), e1381–e1381. https: //doi.org/10.1002/widm.1381

work page doi:10.1002/widm.1381 2020
[16]

Greco, S., Vacchetti, B., Apiletti, D., & Cerquitelli, T. (2024). Unsupervised Concept Drift Detection From Deep Learning Representations in Real-Time. IEEE Transactions on Knowl- edge and Data Engineering , 37 (10), 6232–6245. https://doi.org/10.1109/TKDE.2025.3593 123 Grünwald, P., Henzi, A., & Lardy, T. (2022). Anytime-Valid Tests of Conditional Indepe...

work page doi:10.1109/tkde.2025.3593 2024
[17]

Guerdan, L., Coston, A., Wu, Z.S., & Holstein, K. (2023). Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making. In Conference on Fair- ness, Accountability and Transparency (pp. 688–704). https://doi.org/10.1145/3593013.35 94036

work page doi:10.1145/3593013.35 2023
[18]

Gao, H., Ding, Z., & Pan, M. (2022). Incremental Learning Method for Data with Delayed Labels. Computing and informatics , 41(5), 1260–1283. https://doi.org/10.31577/cai_202 2_5_1260

work page doi:10.31577/cai_202 2022
[19]

Hinder, F., Vaquet, V., & Hammer, B. (2024). One or two things we know about concept drift—a survey on monitoring in evolving environments. Part A: detecting concept drift. Frontiers Artif. Intell. , 7, 2–13. https://doi.org/10.3389/frai.2024.1330257 Kivimäki, J., Bialek, J., Nurminen, J., & Kuberski, W. (2024). Confidence-based Estimators for Predictive ...

work page doi:10.3389/frai.2024.1330257 2024
[20]

Koebler, A., Decker, T., Thon, I., Tresp, V., & Buettner, F. (2025). Incremental Uncertainty- aware Performance Monitoring with Active Labeling Intervention. In International Confer- ence on Artificial Intelligence and Statistics (pp. 0–6). https://doi.org/10.48550/arXiv.250 5.07023

work page doi:10.48550/arxiv.250 2025
[21]

Shi, W. (2019). Addressing delayed feedback for continuous training with neural networks in CTR prediction. In ACM Conference on Recommender Systems (pp. 187–195). https: //doi.org/10.1145/3298689.3347002

work page doi:10.1145/3298689.3347002 2019
[22]

Kurshan, E., Shen, H., & Chen, J. (2020). Towards self-regulating AI. https://doi.org/10.1 145/3383455.3422564

work page arXiv 2020
[23]

Lebichot, B., Borgne, Y.-A.L., & Bontempi, G. (2025). PRODEM: A Meta-Model Approach for Performance Degradation Detection in Credit Card Fraud Systems . https://ecmlpkdd.o rg/2025/

2025
[24]

Lee, J., Woo, J., Moon, H., & Lee, K. (2023). Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples. In IEEE International Conference on Computer Vision (pp. 1–9). https://doi.org/10.1109/IC CV51070.2023.01507

work page doi:10.1109/ic 2023
[25]

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., & Zhang, G. (2019). Learning under Concept Drift: A Review . https://doi.org/10.1016/j.knosys.2018.10.035

work page doi:10.1016/j.knosys.2018.10.035 2019
[26]

Lukats, D., Zielinski, O., Hahn, A., & Stahl, F.T. (2024). A benchmark and survey of fully unsupervised concept drift detectors on real-world data streams. International Journal of Data Science and Analysis , 19(1), 1–31. https://doi.org/10.1007/s41060-024-00620-y

work page doi:10.1007/s41060-024-00620-y 2024
[27]

Mahdi, O.A., Pardede, E., Ali, N., & Cao, J. (2020). Fast Reaction to Sudden Concept Drift in the Absence of Class Labels. Applied Sciences, 10(2), 1–16. https://doi.org/10.3390/ap p10020606

work page doi:10.3390/ap 2020
[28]

McGregor, S., & Hostetler, J. (2023). Data-Centric Governance. arXiv.org [Preprint]. https: //doi.org/10.48550/arXiv.2302.07872

work page doi:10.48550/arxiv.2302.07872 2023
[29]

Morse, K., Brown, C., Fleming, S., Todd, I., Powell, A., Russell, A., Scheinker, D., Suther- land, S., Lu, J., Watkins, B., Shah, N., Pageler, N.M., & Palma, J. (2022). Monitoring Approaches for a Pediatric Chronic Kidney Disease Machine Learning Model. Applied Clin- ical Informatics , 13(02), 431–438. https://doi.org/10.1055/s-0042-1746168

work page doi:10.1055/s-0042-1746168 2022
[30]

Muhammad, A.E., Yow, K., & Alsenan, S.A. (2026). Audit-as-code: a policy-as-code framework for continuous AI assurance. In Frontiers in Artificial Intelligence (pp. 0–16). https://doi.org/10.3389/frai.2026.1759211 Mökander, J., & Axente, M. (2021a). Ethics-based auditing of automated decision-making systems: intervention points and policy implications. AI...

work page doi:10.3389/frai.2026.1759211 2026
[31]

Nguyen, V., Shui, C., Giri, V., Arya, S., Verma, A., Razak, F., & Krishnan, R.G. (2025). Reliably detecting model failures in deployment without labels. arXiv.org [Preprint]. https: //doi.org/10.48550/arXiv.2506.05047

work page doi:10.48550/arxiv.2506.05047 2025
[32]

Nwaodike, C. (2022). Establishing evidence-driven AI risk governance systems to prevent opaque decision-making in Critical Public Services across Global Jurisdictions. International journal of computing and artificial intelligence , 3(2), 130–140. https://doi.org/10.33545/2 7076571.2022.v3.i2a.245

work page doi:10.33545/2 2022
[33]

Pozzolo, A.D., Boracchi, G., Caelen, O., Alippi, C., & Bontempi, G. (2015). Credit card fraud detection and concept-drift adaptation with delayed supervised information. In IEEE International Joint Conference on Neural Network (pp. 2–7). https://doi.org/10.1109/IJCN N.2015.7280527

work page doi:10.1109/ijcn 2015
[34]

Prinster, D., Han, X., Liu, A., & Saria, S. (2025). W ATCH: Adaptive Monitoring for AI Deployments via Weighted-Conformal Martingales. In International Conference on Machine Learning (pp. 1–30). https://doi.org/10.48550/arXiv.2505.04608

work page doi:10.48550/arxiv.2505.04608 2025
[35]

Rabanser, S., Günnemann, S., & Lipton, Z.C. (2018). Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. Neural Information Processing Systems [Preprint]. https://doi.org/10.48550/arXiv.1810.11953

work page doi:10.48550/arxiv.1810.11953 2018
[36]

Ramdas, A., Ruf, J., Larsson, M., & Koolen, W.M. (2020). Admissible anytime-valid sequen- tial inference must rely on nonnegative martingales. [Preprint]. https://doi.org/10.48550/a rXiv.2009.03167

work page doi:10.48550/a 2020
[38]

Ramdas, A., Barber, R.F., Candès, E.J., & Tibshirani, R.J. (2022b). Testing Exchangeabil- ity: Fork-Convexity, Supermartingales and E-Processes. [Preprint]. https://doi.org/10.485 50/arXiv.2103.06476

work page arXiv
[39]

Sethi, T.S., & Kantardzic, M. (2017). On the reliable detection of concept drift from stream- ing unlabeled data. Expert systems with applications , 82, 1–29. https://doi.org/10.1016/j. eswa.2017.04.008

work page doi:10.1016/j 2017
[40]

Simonetto, T., Cordy, M., Ghamizi, S., Traon, Y.L., Lefebvre, C., Boystov, A., & Goujon, A. (2024). On the Impact of Industrial Delays when Mitigating Distribution Drifts: An Empirical Study on Real-World Financial Systems. In Delta (pp. 1–17). https://doi.org/10 .1007/978-3-031-82346-6_4

2024
[41]

Solozobov, O. (2026a). Distinguishing Governance from Compliance Evidence: A Framework for Post-Incident Reconstruction. Social Science Research Network [Preprint]. https://doi. org/10.2139/ssrn.6457861

work page doi:10.2139/ssrn.6457861
[42]

Solozobov, O. (2026b). Evidence Suﬀiciency Calculator . https://doi.org/10.5281/zenodo.1 9233931

work page doi:10.5281/zenodo.1
[43]

Solozobov, O. (2026c). Governance Drift Toolkit . https://doi.org/10.5281/zenodo.19236418

work page doi:10.5281/zenodo.19236418
[44]

Souza, V., Silva, T.P.D., & Batista, G.E.A.P.A. (2018). Evaluating Stream Classifiers with Delayed Labels Information. In Brazilian Conference on Intelligent Systems (pp. 1260–1283). https://doi.org/10.1109/BRACIS.2018.00077

work page doi:10.1109/bracis.2018.00077 2018
[45]

Sudjianto, A., & Zhang, A. (2024). Model Validation Practice in Banking: A Structured Approach for Predictive Models. arXiv preprint [Preprint]. https://doi.org/10.48550/arXiv .2410.13877

work page internal anchor Pith review doi:10.48550/arxiv 2024
[46]

Tan, N., Shih, Y.-C., Yang, D., & Salunkhe, A. (2025). Flexible and Eﬀicient Drift Detection without Labels. arXiv.org [Preprint]. https://doi.org/10.48550/arXiv.2506.08734

work page doi:10.48550/arxiv.2506.08734 2025
[47]

Vasilieva, I., & Petrov, O. (2025). An Empirical Survey of Fully Unsupervised Drift Detection Algorithms for Data Streams. International journal of data science and machine learning , 05(01), 20–28. https://doi.org/10.55640/ijdsml-05-01-05

work page doi:10.55640/ijdsml-05-01-05 2025
[48]

Webb, G.I., Hyde, R., Cao, H., Nguyen, H.L., & Petitjean, F. (2016). Characterizing concept drift. https://doi.org/10.1007/s10618-015-0448-4 Žliobaitė, I. (2010). Change with Delayed Labeling: When is it Detectable? In 2010 IEEE International Conference on Data Mining Workshops (pp. 0–6). https://doi.org/10.1109/IC DMW.2010.49

work page doi:10.1007/s10618-015-0448-4 2016