Recognition: unknown
Anomaly Detection in Soil Heavy Metal Contamination Using Unsupervised Learning for Environmental Risk Assessment
Pith reviewed 2026-05-07 08:51 UTC · model grok-4.3
The pith
Unsupervised machine learning identifies specific anomalous soil samples with elevated heavy metal risks that standard indices overlook.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that Isolation Forest and PCA reconstruction error each flag 12 anomalous samples (15.4 percent of the total), while a consensus method isolates six robust anomalies (7.7 percent), all located at site S3. These anomalies exhibit 70-80 percent higher mean Hazard Index values than normal samples, with every consensus anomaly exceeding the HI=1 safety threshold. PCA reconstruction error correlates positively with the Hazard Index at r approximately 0.8. The work identifies three distinct anomaly types: extreme copper enrichment at S3, anomalously low nickel at S4 and S5, and moderate lead-zinc co-elevation at S9 through S12.
What carries the argument
Unsupervised anomaly detection algorithms (Isolation Forest, PCA reconstruction error, and DBSCAN) applied to multivariate heavy metal concentration data, cross-checked against Hazard Index and Incremental Lifetime Cancer Risk values.
Load-bearing premise
The unsupervised algorithms correctly separate genuine contamination anomalies from normal soil variation or sampling artifacts without any labeled ground-truth examples of known high-risk or clean sites.
What would settle it
A follow-up study that obtains independent ground-truth labels, such as verified laboratory re-testing or health outcome data at the flagged anomalous sites versus the non-anomalous sites, would show whether the detected points truly correspond to elevated risks or represent false positives.
read the original abstract
Soil contamination by heavy metals poses a persistent environmental and public health concern in rapidly urbanising regions of Ghana, particularly at unregulated waste disposal sites. This study applies an unsupervised machine learning framework to detect and characterise anomalous heavy metal contamination patterns in soils from twelve waste sites and residential controls in the Central Region, of Ghana. Concentrations of eight metals (As, Cd, Cr, Cu, Hg, Ni, Pb, Zn) were analysed alongside standard health risk indices, including the Hazard Index (HI) and Incremental Lifetime Cancer Risk (ILCR). Isolation Forest and PCA reconstruction error each identified $12$ anomalous samples ($15.4\%$ of $78$ samples), while DBSCAN detected no density-isolated noise points. A consensus approach isolated six robust anomalies ($7.7\%)$, all spatially concentrated at a single site (S3). Anomalies exhibited approximately $70$--$80\%$ higher mean HI values than normal samples, with all consensus anomalies exceeding the HI$=1$ threshold. PCA reconstruction error showed a strong positive association with HI ($r \approx 0.8$), indicating consistency between multivariate deviation and health risk. Three distinct anomaly types were identified: extreme Cu enrichment at S3, anomalously low Ni at S4/S5, and moderate multi-metal (Pb--Zn) co-elevation at S9--S12. The results demonstrate that unsupervised machine learning provides granular, objective insight beyond aggregate indices, enabling targeted site prioritisation and risk-informed environmental management.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript applies unsupervised anomaly detection (Isolation Forest, PCA reconstruction error, DBSCAN) to concentrations of eight heavy metals in 78 soil samples from 12 waste and control sites in Ghana's Central Region. It reports 12 anomalies each from Isolation Forest and PCA (15.4%), none from DBSCAN, a consensus of 6 robust anomalies (7.7%) all at site S3, ~70-80% higher mean HI for anomalies (all exceeding HI=1), a correlation r≈0.8 between PCA error and HI, and three post-hoc anomaly types (extreme Cu enrichment, low Ni, moderate Pb-Zn elevation). The central claim is that this framework supplies granular, objective insight beyond standard aggregate health risk indices (HI, ILCR) for targeted site prioritisation and risk-informed management.
Significance. If the detected anomalies can be shown to correspond to genuine contamination signals rather than artifacts, the work offers a practical demonstration of multivariate methods for environmental monitoring in regions with limited labeled data. Strengths include the multi-algorithm consensus approach and direct linkage to pre-existing health indices; the methods are standard and the dataset size (78 samples) is modest but usable. No machine-checked proofs or open code are mentioned, but the approach could support reproducible follow-up if parameters and data are released.
major comments (2)
- [Results/Discussion] Results/Discussion: The claim that unsupervised ML 'provides granular, objective insight beyond aggregate indices' is load-bearing for the paper's contribution but rests on the untested assumption that the anomalies isolate real signals. The reported r≈0.8 between PCA reconstruction error and HI is unsurprising because both quantities are deterministic functions of the identical eight metal concentrations; without labeled ground truth, external site records, hold-out validation, or comparison to known contamination events, the correlation does not establish added value over the indices themselves.
- [Methods] Methods: The consensus step that reduces 12 anomalies to 6 'robust' ones is presented without justification or sensitivity analysis (e.g., how the intersection is defined, robustness to parameter choice). This directly affects the spatial concentration claim (all at S3) and the subsequent anomaly-type interpretation.
minor comments (3)
- [Abstract/Results] Abstract and Results: No error bars, standard deviations, or statistical tests are reported for the 70-80% higher mean HI comparison or the r≈0.8 correlation, making it impossible to assess whether the differences are significant given the small sample size.
- [Methods] Methods: Hyperparameters for Isolation Forest (contamination fraction), PCA (number of retained components), and DBSCAN (eps, min_samples) are not stated, preventing exact reproduction of the reported anomaly counts.
- [Results] The three post-hoc anomaly types are described qualitatively; a table or figure showing the actual concentration values or loadings for the consensus anomalies would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have prompted us to clarify our claims and strengthen the methodological transparency of the manuscript. We address each major comment below and have incorporated revisions to moderate language, add justification, and include sensitivity analysis where feasible.
read point-by-point responses
-
Referee: [Results/Discussion] Results/Discussion: The claim that unsupervised ML 'provides granular, objective insight beyond aggregate indices' is load-bearing for the paper's contribution but rests on the untested assumption that the anomalies isolate real signals. The reported r≈0.8 between PCA reconstruction error and HI is unsurprising because both quantities are deterministic functions of the identical eight metal concentrations; without labeled ground truth, external site records, hold-out validation, or comparison to known contamination events, the correlation does not establish added value over the indices themselves.
Authors: We agree that the correlation (r ≈ 0.8) between PCA reconstruction error and HI is expected, given that both are functions of the same eight metal concentrations. The manuscript's contribution lies in demonstrating how multivariate unsupervised methods can surface specific patterns—such as the three distinct anomaly types (extreme Cu enrichment, low Ni, and moderate Pb-Zn elevation) and their exclusive concentration at site S3—that are not directly visible from scalar aggregate indices alone. These patterns support targeted prioritization even when overall risk alignment with HI is observed. However, we acknowledge that without ground truth the stronger phrasing of 'insight beyond' cannot be fully substantiated. In the revised manuscript we will replace this language in the abstract and discussion with 'complements standard health risk indices through multivariate pattern detection and spatial granularity.' We will also insert a dedicated limitations paragraph noting the lack of external validation and recommending future studies with labeled or historical contamination data. revision: yes
-
Referee: [Methods] Methods: The consensus step that reduces 12 anomalies to 6 'robust' ones is presented without justification or sensitivity analysis (e.g., how the intersection is defined, robustness to parameter choice). This directly affects the spatial concentration claim (all at S3) and the subsequent anomaly-type interpretation.
Authors: The consensus set was formed by intersecting the 12 anomalies identified by Isolation Forest with the 12 identified by PCA reconstruction error; DBSCAN returned zero anomalies and therefore did not enter the intersection. This choice was intended to retain only those samples flagged as outliers by two independent paradigms (isolation-based and reconstruction-based). We will revise the Methods section to state this definition explicitly and to explain the rationale for requiring agreement across algorithms. In addition, we will perform and report a sensitivity analysis by varying the Isolation Forest contamination fraction (0.10–0.20), the number of retained PCA components (2–5), and DBSCAN’s epsilon and min_samples. The analysis will show that the core consensus of six samples remains stable and continues to be located exclusively at S3 under plausible parameter ranges; these results will be added to the main text or as a supplementary table. revision: yes
- Obtaining labeled ground truth, external site records, hold-out validation sets, or comparisons to documented contamination events, as these data were not collected or available within the original study design.
Circularity Check
No circularity: empirical application of standard algorithms to measured data
full rationale
The paper applies off-the-shelf unsupervised methods (Isolation Forest, PCA reconstruction error, DBSCAN) to the eight measured metal concentrations and reports empirical associations with pre-computed HI/ILCR values. No equation, parameter fit, or self-citation is shown to reduce any reported result (anomaly counts, r≈0.8, anomaly types) to the inputs by construction. The correlation between PCA error and HI is an observed statistical relationship between two independently computed functions of the same raw data, not a definitional equivalence or fitted prediction. The central claim of 'granular insight beyond aggregate indices' is an interpretive conclusion from the empirical outputs rather than a load-bearing derivation that collapses to self-reference. The work is therefore self-contained as a standard data-analysis study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Unsupervised anomaly detection algorithms can reliably distinguish contamination signals from background variation in soil metal data.
Reference graph
Works this paper leans on
-
[1]
Con- tamination status and potential ecological risk assessment of soil heavy metals at central magazine, wa, north-western ghana,
D. Kala, E. E. Bayari, M.-M. Pedavoah, and O. A. Oluyinka, “Con- tamination status and potential ecological risk assessment of soil heavy metals at central magazine, wa, north-western ghana,”Environmental Geochemistry and Health, vol. 48, no. 1, Dec. 2025
2025
-
[2]
Health risks from persistent heavy metal contamination in crops and water at an abandoned naturally revegetated galamsey site in ghana,
J. P. Mensah, S. Oduro-Kwarteng, K. Miezah, P. Boakye, and B. Koom- son, “Health risks from persistent heavy metal contamination in crops and water at an abandoned naturally revegetated galamsey site in ghana,” Scientific Reports, vol. 15, no. 1, Oct. 2025
2025
-
[3]
Framework for metals risk assessment,
A. Fairbrother, R. Wenstel, K. Sappington, and W. Wood, “Framework for metals risk assessment,”Ecotoxicology and environmental safety, vol. 68, no. 2, pp. 145–227, 2007
2007
-
[4]
Isolation forest for environmental monitoring: A data-driven approach to land management,
M. S. Binetti, V . F. Uricchio, and C. Massarelli, “Isolation forest for environmental monitoring: A data-driven approach to land management,” Environments, vol. 12, no. 4, p. 116, Apr. 2025
2025
-
[5]
Investigating the association of environmental exposures and all-cause mortality in the uk biobank using sparse principal component analysis,
M. Mamouei, Y . Zhu, M. Nazarzadeh, A. Hassaine, G. Salimi-Khorshidi, Y . Cai, and K. Rahimi, “Investigating the association of environmental exposures and all-cause mortality in the uk biobank using sparse principal component analysis,”Scientific Reports, vol. 12, no. 1, Jun. 2022
2022
-
[6]
Machine learning models with innovative outlier detection techniques for predicting heavy metal contamination in soils,
R. Proshad, S. Asharaful Abedin Asha, R. Tan, Y . Lu, M. A. Abedin, Z. Ding, S. Zhang, Z. Li, G. Chen, and Z. Zhao, “Machine learning models with innovative outlier detection techniques for predicting heavy metal contamination in soils,”Journal of Hazardous Materials, vol. 481, p. 136536, Jan. 2025
2025
-
[7]
New perspective on density-based spatial clustering of applications with noise for groundwater assessment,
A. M. Jibrin, M. Al-Suwaiyan, Z. M. Yaseen, and S. I. Abba, “New perspective on density-based spatial clustering of applications with noise for groundwater assessment,”Journal of Hydrology, vol. 661, p. 133566, Nov. 2025
2025
-
[8]
P. T. Vu, T. V . Larson, and A. A. Szpiro, “Probabilistic predictive principal component analysis for spatially misaligned and high-dimensional air pollution data with missing observations,” Environmetrics, vol. 31, no. 4, Dec. 2019. [Online]. Available: http://dx.doi.org/10.1002/env.2614
-
[9]
The evaluation and sources of heavy metal anomalies in the surface soil of eastern tibet,
M. Wang, L. Yang, J. Li, and Q. Liang, “The evaluation and sources of heavy metal anomalies in the surface soil of eastern tibet,”Minerals, vol. 13, no. 1, p. 86, 2023
2023
-
[10]
A modified model for quantitative heavy metal source apportionment and pollution pathway identification,
M. Wang, P. Yu, Z. Tong, X. Shao, J. Peng, Y . Hamid, and Y . Huang, “A modified model for quantitative heavy metal source apportionment and pollution pathway identification,”Toxics, vol. 12, no. 6, p. 382, 2024
2024
-
[11]
Impact of industrial activities on soil quality in urban settings: a study of heavy metal concentrations in lamashegu, ghana,
S. J. Cobbina, A. R. Edu, E. E. Bosso, E. Bampoe, and S. Gautam, “Impact of industrial activities on soil quality in urban settings: a study of heavy metal concentrations in lamashegu, ghana,”Discover Soil, vol. 2, no. 1, p. 68, 2025
2025
-
[12]
Principal component analysis as a tool to indicate the origin of potentially toxic elements in soils,
L. Bor ˚uvka, O. Vacek, and J. Jehli ˇcka, “Principal component analysis as a tool to indicate the origin of potentially toxic elements in soils,” Geoderma, vol. 128, no. 3-4, pp. 289–300, 2005. 4 2 0 2 PCA Component 1 2 1 0 1 2 3 PCA Component 2 Isolation Forest Anomalies in PCA Space Isolation Forest Anomalies 4 2 0 2 PCA Component 1 2 1 0 1 2 3 PCA Com...
2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.