arxiv: 2604.27102 · v1 · submitted 2026-04-29 · 💻 cs.LG · cs.AI· physics.data-an· physics.geo-ph

Recognition: unknown

Anomaly Detection in Soil Heavy Metal Contamination Using Unsupervised Learning for Environmental Risk Assessment

Isaac Tettey Adjokatse , Samuel Senyo Koranteng , George Yamoah Afrifa , Theophilus Ansah-Narh , Marcellin Atemkeng , Joseph Bremang Tandoh , Kow Ahor Essel-Yorke , Richmond Opoku-Sarkodie

show 1 more author

Rebecca Davis

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIphysics.data-anphysics.geo-ph

keywords anomaly detectionunsupervised learningheavy metal contaminationsoil pollutionenvironmental risk assessmenthealth risk indicesGhana waste sitesmachine learning application

0 comments

The pith

Unsupervised machine learning identifies specific anomalous soil samples with elevated heavy metal risks that standard indices overlook.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies unsupervised machine learning to soil samples collected from waste sites and residential areas in Ghana's Central Region. It processes concentrations of eight heavy metals to flag unusual patterns across 78 samples. The flagged anomalies show substantially higher health risk scores than the rest of the data and concentrate at one particular site. This approach distinguishes multiple types of contamination signatures and shows consistency with conventional risk calculations. It supplies a more detailed basis for deciding where to focus environmental management efforts.

Core claim

The authors establish that Isolation Forest and PCA reconstruction error each flag 12 anomalous samples (15.4 percent of the total), while a consensus method isolates six robust anomalies (7.7 percent), all located at site S3. These anomalies exhibit 70-80 percent higher mean Hazard Index values than normal samples, with every consensus anomaly exceeding the HI=1 safety threshold. PCA reconstruction error correlates positively with the Hazard Index at r approximately 0.8. The work identifies three distinct anomaly types: extreme copper enrichment at S3, anomalously low nickel at S4 and S5, and moderate lead-zinc co-elevation at S9 through S12.

What carries the argument

Unsupervised anomaly detection algorithms (Isolation Forest, PCA reconstruction error, and DBSCAN) applied to multivariate heavy metal concentration data, cross-checked against Hazard Index and Incremental Lifetime Cancer Risk values.

Load-bearing premise

The unsupervised algorithms correctly separate genuine contamination anomalies from normal soil variation or sampling artifacts without any labeled ground-truth examples of known high-risk or clean sites.

What would settle it

A follow-up study that obtains independent ground-truth labels, such as verified laboratory re-testing or health outcome data at the flagged anomalous sites versus the non-anomalous sites, would show whether the detected points truly correspond to elevated risks or represent false positives.

read the original abstract

Soil contamination by heavy metals poses a persistent environmental and public health concern in rapidly urbanising regions of Ghana, particularly at unregulated waste disposal sites. This study applies an unsupervised machine learning framework to detect and characterise anomalous heavy metal contamination patterns in soils from twelve waste sites and residential controls in the Central Region, of Ghana. Concentrations of eight metals (As, Cd, Cr, Cu, Hg, Ni, Pb, Zn) were analysed alongside standard health risk indices, including the Hazard Index (HI) and Incremental Lifetime Cancer Risk (ILCR). Isolation Forest and PCA reconstruction error each identified $12$ anomalous samples ($15.4\%$ of $78$ samples), while DBSCAN detected no density-isolated noise points. A consensus approach isolated six robust anomalies ($7.7\%)$, all spatially concentrated at a single site (S3). Anomalies exhibited approximately $70$--$80\%$ higher mean HI values than normal samples, with all consensus anomalies exceeding the HI$=1$ threshold. PCA reconstruction error showed a strong positive association with HI ($r \approx 0.8$), indicating consistency between multivariate deviation and health risk. Three distinct anomaly types were identified: extreme Cu enrichment at S3, anomalously low Ni at S4/S5, and moderate multi-metal (Pb--Zn) co-elevation at S9--S12. The results demonstrate that unsupervised machine learning provides granular, objective insight beyond aggregate indices, enabling targeted site prioritisation and risk-informed environmental management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies standard unsupervised methods to a new Ghana soil dataset and links anomalies to health indices, but the lack of ground truth makes the 'beyond aggregate indices' claim hard to sustain.

read the letter

This paper collects concentrations for eight heavy metals from 78 soil samples at twelve waste and control sites in Ghana's Central Region, then runs Isolation Forest, PCA reconstruction error, and DBSCAN to flag anomalies. It reports that a consensus set of six samples, mostly from one site, shows 70-80% higher mean Hazard Index values and that PCA error correlates with HI at r ≈ 0.8. Three post-hoc anomaly patterns (high Cu, low Ni, moderate Pb-Zn) are described as well. The dataset from unregulated sites is the concrete contribution, and the spatial concentration of flagged points gives a clear, usable output for local prioritisation. The basic pipeline is executed without obvious errors in the reported numbers. The main limitation is validation. No labeled ground truth, external site records, or hold-out checks are described, so it is difficult to tell whether the anomalies capture real contamination signals or simply reflect natural variation and sampling effects in a modest-sized set. The correlation with HI is unsurprising because both quantities are deterministic functions of the same metal concentrations. The consensus step and the three anomaly types therefore remain interpretive rather than independently corroborated. This is mainly for applied environmental researchers or local agencies in Ghana and similar regions who need a screening example that combines multivariate methods with existing risk indices. A reader looking for new algorithms or general principles will not find them. I would send it for peer review. The regional data and straightforward application are worth referee time, provided the authors can address the validation gap and tone down the stronger claims about objective insight.

Referee Report

2 major / 3 minor

Summary. The manuscript applies unsupervised anomaly detection (Isolation Forest, PCA reconstruction error, DBSCAN) to concentrations of eight heavy metals in 78 soil samples from 12 waste and control sites in Ghana's Central Region. It reports 12 anomalies each from Isolation Forest and PCA (15.4%), none from DBSCAN, a consensus of 6 robust anomalies (7.7%) all at site S3, ~70-80% higher mean HI for anomalies (all exceeding HI=1), a correlation r≈0.8 between PCA error and HI, and three post-hoc anomaly types (extreme Cu enrichment, low Ni, moderate Pb-Zn elevation). The central claim is that this framework supplies granular, objective insight beyond standard aggregate health risk indices (HI, ILCR) for targeted site prioritisation and risk-informed management.

Significance. If the detected anomalies can be shown to correspond to genuine contamination signals rather than artifacts, the work offers a practical demonstration of multivariate methods for environmental monitoring in regions with limited labeled data. Strengths include the multi-algorithm consensus approach and direct linkage to pre-existing health indices; the methods are standard and the dataset size (78 samples) is modest but usable. No machine-checked proofs or open code are mentioned, but the approach could support reproducible follow-up if parameters and data are released.

major comments (2)

[Results/Discussion] Results/Discussion: The claim that unsupervised ML 'provides granular, objective insight beyond aggregate indices' is load-bearing for the paper's contribution but rests on the untested assumption that the anomalies isolate real signals. The reported r≈0.8 between PCA reconstruction error and HI is unsurprising because both quantities are deterministic functions of the identical eight metal concentrations; without labeled ground truth, external site records, hold-out validation, or comparison to known contamination events, the correlation does not establish added value over the indices themselves.
[Methods] Methods: The consensus step that reduces 12 anomalies to 6 'robust' ones is presented without justification or sensitivity analysis (e.g., how the intersection is defined, robustness to parameter choice). This directly affects the spatial concentration claim (all at S3) and the subsequent anomaly-type interpretation.

minor comments (3)

[Abstract/Results] Abstract and Results: No error bars, standard deviations, or statistical tests are reported for the 70-80% higher mean HI comparison or the r≈0.8 correlation, making it impossible to assess whether the differences are significant given the small sample size.
[Methods] Methods: Hyperparameters for Isolation Forest (contamination fraction), PCA (number of retained components), and DBSCAN (eps, min_samples) are not stated, preventing exact reproduction of the reported anomaly counts.
[Results] The three post-hoc anomaly types are described qualitatively; a table or figure showing the actual concentration values or loadings for the consensus anomalies would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments, which have prompted us to clarify our claims and strengthen the methodological transparency of the manuscript. We address each major comment below and have incorporated revisions to moderate language, add justification, and include sensitivity analysis where feasible.

read point-by-point responses

Referee: [Results/Discussion] Results/Discussion: The claim that unsupervised ML 'provides granular, objective insight beyond aggregate indices' is load-bearing for the paper's contribution but rests on the untested assumption that the anomalies isolate real signals. The reported r≈0.8 between PCA reconstruction error and HI is unsurprising because both quantities are deterministic functions of the identical eight metal concentrations; without labeled ground truth, external site records, hold-out validation, or comparison to known contamination events, the correlation does not establish added value over the indices themselves.

Authors: We agree that the correlation (r ≈ 0.8) between PCA reconstruction error and HI is expected, given that both are functions of the same eight metal concentrations. The manuscript's contribution lies in demonstrating how multivariate unsupervised methods can surface specific patterns—such as the three distinct anomaly types (extreme Cu enrichment, low Ni, and moderate Pb-Zn elevation) and their exclusive concentration at site S3—that are not directly visible from scalar aggregate indices alone. These patterns support targeted prioritization even when overall risk alignment with HI is observed. However, we acknowledge that without ground truth the stronger phrasing of 'insight beyond' cannot be fully substantiated. In the revised manuscript we will replace this language in the abstract and discussion with 'complements standard health risk indices through multivariate pattern detection and spatial granularity.' We will also insert a dedicated limitations paragraph noting the lack of external validation and recommending future studies with labeled or historical contamination data. revision: yes
Referee: [Methods] Methods: The consensus step that reduces 12 anomalies to 6 'robust' ones is presented without justification or sensitivity analysis (e.g., how the intersection is defined, robustness to parameter choice). This directly affects the spatial concentration claim (all at S3) and the subsequent anomaly-type interpretation.

Authors: The consensus set was formed by intersecting the 12 anomalies identified by Isolation Forest with the 12 identified by PCA reconstruction error; DBSCAN returned zero anomalies and therefore did not enter the intersection. This choice was intended to retain only those samples flagged as outliers by two independent paradigms (isolation-based and reconstruction-based). We will revise the Methods section to state this definition explicitly and to explain the rationale for requiring agreement across algorithms. In addition, we will perform and report a sensitivity analysis by varying the Isolation Forest contamination fraction (0.10–0.20), the number of retained PCA components (2–5), and DBSCAN’s epsilon and min_samples. The analysis will show that the core consensus of six samples remains stable and continues to be located exclusively at S3 under plausible parameter ranges; these results will be added to the main text or as a supplementary table. revision: yes

standing simulated objections not resolved

Obtaining labeled ground truth, external site records, hold-out validation sets, or comparisons to documented contamination events, as these data were not collected or available within the original study design.

Circularity Check

0 steps flagged

No circularity: empirical application of standard algorithms to measured data

full rationale

The paper applies off-the-shelf unsupervised methods (Isolation Forest, PCA reconstruction error, DBSCAN) to the eight measured metal concentrations and reports empirical associations with pre-computed HI/ILCR values. No equation, parameter fit, or self-citation is shown to reduce any reported result (anomaly counts, r≈0.8, anomaly types) to the inputs by construction. The correlation between PCA error and HI is an observed statistical relationship between two independently computed functions of the same raw data, not a definitional equivalence or fitted prediction. The central claim of 'granular insight beyond aggregate indices' is an interpretive conclusion from the empirical outputs rather than a load-bearing derivation that collapses to self-reference. The work is therefore self-contained as a standard data-analysis study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that standard unsupervised algorithms will surface meaningful contamination patterns when applied to this small environmental dataset; no new entities or fitted parameters are introduced beyond routine algorithm hyperparameters.

axioms (1)

domain assumption Unsupervised anomaly detection algorithms can reliably distinguish contamination signals from background variation in soil metal data.
Invoked when interpreting Isolation Forest and PCA outputs as true anomalies rather than artifacts.

pith-pipeline@v0.9.0 · 5631 in / 1151 out tokens · 62357 ms · 2026-05-07T08:51:35.476096+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages

[1]

Con- tamination status and potential ecological risk assessment of soil heavy metals at central magazine, wa, north-western ghana,

D. Kala, E. E. Bayari, M.-M. Pedavoah, and O. A. Oluyinka, “Con- tamination status and potential ecological risk assessment of soil heavy metals at central magazine, wa, north-western ghana,”Environmental Geochemistry and Health, vol. 48, no. 1, Dec. 2025

2025
[2]

Health risks from persistent heavy metal contamination in crops and water at an abandoned naturally revegetated galamsey site in ghana,

J. P. Mensah, S. Oduro-Kwarteng, K. Miezah, P. Boakye, and B. Koom- son, “Health risks from persistent heavy metal contamination in crops and water at an abandoned naturally revegetated galamsey site in ghana,” Scientific Reports, vol. 15, no. 1, Oct. 2025

2025
[3]

Framework for metals risk assessment,

A. Fairbrother, R. Wenstel, K. Sappington, and W. Wood, “Framework for metals risk assessment,”Ecotoxicology and environmental safety, vol. 68, no. 2, pp. 145–227, 2007

2007
[4]

Isolation forest for environmental monitoring: A data-driven approach to land management,

M. S. Binetti, V . F. Uricchio, and C. Massarelli, “Isolation forest for environmental monitoring: A data-driven approach to land management,” Environments, vol. 12, no. 4, p. 116, Apr. 2025

2025
[5]

Investigating the association of environmental exposures and all-cause mortality in the uk biobank using sparse principal component analysis,

M. Mamouei, Y . Zhu, M. Nazarzadeh, A. Hassaine, G. Salimi-Khorshidi, Y . Cai, and K. Rahimi, “Investigating the association of environmental exposures and all-cause mortality in the uk biobank using sparse principal component analysis,”Scientific Reports, vol. 12, no. 1, Jun. 2022

2022
[6]

Machine learning models with innovative outlier detection techniques for predicting heavy metal contamination in soils,

R. Proshad, S. Asharaful Abedin Asha, R. Tan, Y . Lu, M. A. Abedin, Z. Ding, S. Zhang, Z. Li, G. Chen, and Z. Zhao, “Machine learning models with innovative outlier detection techniques for predicting heavy metal contamination in soils,”Journal of Hazardous Materials, vol. 481, p. 136536, Jan. 2025

2025
[7]

New perspective on density-based spatial clustering of applications with noise for groundwater assessment,

A. M. Jibrin, M. Al-Suwaiyan, Z. M. Yaseen, and S. I. Abba, “New perspective on density-based spatial clustering of applications with noise for groundwater assessment,”Journal of Hydrology, vol. 661, p. 133566, Nov. 2025

2025
[8]

Probabilistic predictive principal component analysis for spatially misaligned and high-dimensional air pollution data with missing observations,

P. T. Vu, T. V . Larson, and A. A. Szpiro, “Probabilistic predictive principal component analysis for spatially misaligned and high-dimensional air pollution data with missing observations,” Environmetrics, vol. 31, no. 4, Dec. 2019. [Online]. Available: http://dx.doi.org/10.1002/env.2614

work page doi:10.1002/env.2614 2019
[9]

The evaluation and sources of heavy metal anomalies in the surface soil of eastern tibet,

M. Wang, L. Yang, J. Li, and Q. Liang, “The evaluation and sources of heavy metal anomalies in the surface soil of eastern tibet,”Minerals, vol. 13, no. 1, p. 86, 2023

2023
[10]

A modified model for quantitative heavy metal source apportionment and pollution pathway identification,

M. Wang, P. Yu, Z. Tong, X. Shao, J. Peng, Y . Hamid, and Y . Huang, “A modified model for quantitative heavy metal source apportionment and pollution pathway identification,”Toxics, vol. 12, no. 6, p. 382, 2024

2024
[11]

Impact of industrial activities on soil quality in urban settings: a study of heavy metal concentrations in lamashegu, ghana,

S. J. Cobbina, A. R. Edu, E. E. Bosso, E. Bampoe, and S. Gautam, “Impact of industrial activities on soil quality in urban settings: a study of heavy metal concentrations in lamashegu, ghana,”Discover Soil, vol. 2, no. 1, p. 68, 2025

2025
[12]

Principal component analysis as a tool to indicate the origin of potentially toxic elements in soils,

L. Bor ˚uvka, O. Vacek, and J. Jehli ˇcka, “Principal component analysis as a tool to indicate the origin of potentially toxic elements in soils,” Geoderma, vol. 128, no. 3-4, pp. 289–300, 2005. 4 2 0 2 PCA Component 1 2 1 0 1 2 3 PCA Component 2 Isolation Forest Anomalies in PCA Space Isolation Forest Anomalies 4 2 0 2 PCA Component 1 2 1 0 1 2 3 PCA Com...

2005