Image Thresholding: Understanding Bias of Evaluation Metrics towards Specific Evaluation Functions
Pith reviewed 2026-06-29 17:53 UTC · model grok-4.3
The pith
SSIM and PSNR correlate more strongly with Otsu's between-class variance than with Kapur's entropy when both are measured across all thresholds on BSDS500 images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the values of Otsu's between-class variance and Kapur's entropy are computed for every threshold on BSDS500 images and then correlated with the corresponding SSIM and PSNR scores, Otsu's values exhibit high correlation with both metrics on all images for PSNR and on more than 91 percent of images for SSIM, whereas Kapur's values exhibit lower and more variable correlation. This pattern demonstrates an inherent bias in the quality metrics toward one specific objective function over the other.
What carries the argument
Pearson correlation computed between objective-function values and quality-metric values across every possible threshold for each BSDS500 image; the resulting per-image correlation coefficients are then compared between Otsu and Kapur.
If this is right
- Any ranking of thresholding algorithms that relies on SSIM or PSNR will tend to favor Otsu-based methods over Kapur-based methods even when segmentation quality is equivalent.
- Metaheuristic comparisons published with SSIM or PSNR as the sole arbiter already embed the detected preference.
- Claims of superiority for a new thresholding criterion become harder to interpret without first checking its correlation profile against the same metrics.
- The bias finding applies directly to the common practice of optimizing multilevel thresholds with metaheuristics and then reporting SSIM or PSNR gains.
Where Pith is reading between the lines
- If the bias holds, then papers that introduce new objective functions may need to report performance under multiple independent quality measures rather than SSIM or PSNR alone.
- The same correlation-analysis approach could be applied to other segmentation tasks where objective functions are evaluated via perceptual metrics.
- A practical next step would be to test whether the correlation gap persists when the quality metrics are replaced by direct overlap measures against human ground-truth segmentations.
Load-bearing premise
That the observed difference in correlation strength on the BSDS500 set demonstrates a general, intrinsic bias of the metrics rather than an artifact of the chosen images or the exhaustive-threshold computation method.
What would settle it
Repeating the same exhaustive-threshold correlation analysis on a different large image collection and obtaining either comparable correlations for both objectives or reversed rankings would contradict the claim of metric bias.
Figures
read the original abstract
Multilevel image thresholding is widely used for segmentation in applications ranging from medical imaging to remote sensing. Classical objective functions, such as Otsu's between-class variance and Kapur's entropy, are often optimized using metaheuristic algorithms, with performance evaluated via metrics like Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR). These evaluations implicitly assume that SSIM and PSNR provide unbiased measures of segmentation quality. In this study, we examine this assumption by analyzing the correlation between thresholding objective functions and quality metrics across all possible thresholds for images in the BSDS500 dataset. Results show that Otsu's criterion consistently exhibits high correlation with both SSIM and PSNR, while Kapur's entropy demonstrates weaker and more variable correlation. Otsu outperforms Kapur in correlation with PSNR for all images and with SSIM for over 91%. Our findings reveal an inherent metric-objective-function bias. This work highlights the need for more neutral evaluation frameworks and motivates extending the analysis to additional thresholding criteria and domains. Source code of this paper can be found at https://w3id.org/met-dp/icpr26-95
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that an analysis of correlations between thresholding objective functions (Otsu's between-class variance and Kapur's entropy) and image quality metrics (SSIM, PSNR) across all thresholds on the BSDS500 dataset reveals that Otsu's criterion has higher and more consistent correlations, indicating an inherent bias of the evaluation metrics towards Otsu's function over Kapur's.
Significance. If the central claim is supported by appropriate experiments, this would be of moderate significance to the computer vision community, as it questions the neutrality of standard evaluation practices for multilevel thresholding algorithms optimized by metaheuristics. The provision of source code is a strength that aids reproducibility.
major comments (3)
- [Abstract] The reported correlations are computed across all possible thresholds rather than on the results of optimizing the objective functions. This approach does not establish whether the bias affects the evaluation of actual thresholding results obtained from metaheuristic optimization, which is the typical use case described in the introduction.
- [Abstract] No information is given on the specific correlation coefficient used (e.g., Pearson's r), the handling of multilevel thresholding cases, or any statistical significance testing or correction for multiple comparisons across the 500 images.
- [Abstract] The conclusion of 'inherent metric-objective-function bias' is drawn from differential correlations, but the paper does not demonstrate that this leads to different conclusions when SSIM/PSNR are used to compare optimized segmentations from different objectives.
minor comments (1)
- The abstract mentions 'multilevel image thresholding' but the analysis description focuses on 'all possible thresholds,' which typically applies to bi-level; clarify the scope.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions where appropriate to strengthen the work.
read point-by-point responses
-
Referee: [Abstract] The reported correlations are computed across all possible thresholds rather than on the results of optimizing the objective functions. This approach does not establish whether the bias affects the evaluation of actual thresholding results obtained from metaheuristic optimization, which is the typical use case described in the introduction.
Authors: Our choice to compute correlations across all possible thresholds provides an exhaustive, optimization-algorithm-independent view of the relationship between the objective functions and the quality metrics. This isolates the inherent alignment without introducing variability from specific metaheuristics, directly supporting the claim of metric-objective bias. We agree, however, that explicit validation on results from metaheuristic optimization would better connect to the typical use case and will add such experiments in the revision. revision: yes
-
Referee: [Abstract] No information is given on the specific correlation coefficient used (e.g., Pearson's r), the handling of multilevel thresholding cases, or any statistical significance testing or correction for multiple comparisons across the 500 images.
Authors: We used Pearson's correlation coefficient throughout. Multilevel cases were handled by enumerating all valid threshold combinations up to the maximum level. We will expand the abstract and methods section to explicitly state the correlation coefficient, detail the multilevel enumeration procedure, and report statistical significance testing with multiple-comparison correction (e.g., Bonferroni) across the 500 images. revision: yes
-
Referee: [Abstract] The conclusion of 'inherent metric-objective-function bias' is drawn from differential correlations, but the paper does not demonstrate that this leads to different conclusions when SSIM/PSNR are used to compare optimized segmentations from different objectives.
Authors: The consistently higher correlations for Otsu imply that SSIM and PSNR would systematically favor Otsu-optimized results over Kapur-optimized ones, but we acknowledge that the manuscript does not include explicit optimization runs and subsequent ranking comparisons to illustrate divergent conclusions. We will add metaheuristic optimization experiments and direct comparison of evaluation outcomes in the revised manuscript to substantiate this implication. revision: yes
Circularity Check
No circularity: direct empirical correlations on BSDS500 data.
full rationale
The paper reports Pearson (and similar) correlations computed between raw objective-function values (Otsu between-class variance, Kapur entropy) and SSIM/PSNR for every threshold on BSDS500 images. This is a straightforward data-driven measurement with no equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. No derivation chain exists that reduces to its own inputs by construction; the result is an independent empirical observation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Higher correlation between an objective function and a quality metric indicates that the metric is biased toward that objective function.
Reference graph
Works this paper leans on
-
[1]
Amiriebrahimabadi, M., Rouhi, Z., Mansouri, N.: A comprehensive sur- vey of multi-level thresholding segmentation methods for image processing. Archives of Computational Methods in Engineering31(6), 3647–3697 (Aug 2024), ISSN 1886-1784, https://doi.org/10.1007/s11831-024-10093-8
-
[2]
IEEE Transactions on Pattern Analysis and Machine Intelligence33(5), 898–916 (2011), https://doi.org/10.1109/ TPAMI.2010.161
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence33(5), 898–916 (2011), https://doi.org/10.1109/ TPAMI.2010.161
2011
-
[3]
Journal of WSCG33, 83–92 (Jul 2025), https://doi.org/10.24132/JWSCG.2025-9
Hegazy, E., Gabr, M.: A multi-level thresholding algorithm for threshold count and values identification based on dynamic programming. Journal of WSCG33, 83–92 (Jul 2025), https://doi.org/10.24132/JWSCG.2025-9
-
[4]
Kalyani, R., Sathya, P.D., Sakthivel, V.P.: Image segmentation with kapur, otsu and minimum cross entropy based multilevel thresholding aided with cuckoo search algorithm. IOP Conference Series: Materials Science and En- gineering1119(1),012019(mar2021),https://doi.org/10.1088/1757-899X/ 1119/1/012019, URL https://doi.org/10.1088/1757-899X/1119/1/012019
-
[5]
Kapur, J., Sahoo, P., Wong, A.: A new method for gray-level picture thresholding using the entropy of the histogram. Computer Vision, Graph- ics, and Image Processing29(3), 273–285 (1985), ISSN 0734-189X, https: //doi.org/https://doi.org/10.1016/0734-189X(85)90125-2, URL https:// www.sciencedirect.com/science/article/pii/0734189X85901252
-
[6]
Pattern Recog- nition19(1), 41–47 (1986), ISSN 0031-3203, https://doi.org/https://doi
Kittler, J., Illingworth, J.: Minimum error thresholding. Pattern Recog- nition19(1), 41–47 (1986), ISSN 0031-3203, https://doi.org/https://doi. Understanding Bias of Evaluation Metrics towards Evaluation Functions 15 org/10.1016/0031-3203(86)90030-0, URL https://www.sciencedirect.com/ science/article/pii/0031320386900300
-
[7]
Lei, B., Li, J., Wang, N., Yu, H.: An efficient adaptive masi entropy mul- tilevel thresholding algorithm based on dynamic programming. J. Vis. Co- mun. Image Represent.98(C) (may 2024), ISSN 1047-3203, https://doi. org/10.1016/j.jvcir.2023.104008, URL https://doi.org/10.1016/j.jvcir.2023. 104008
-
[8]
Masi, M.: A step beyond tsallis and renyi entropies. Physics Letters A 338(3), 217–224 (May 2005), ISSN 0375-9601, https://doi.org/10.1016/j. physleta.2005.01.094
work page doi:10.1016/j 2005
-
[9]
Meister, S., Wermes, M.A.M., Stuve, J., Groves, R.M.: Review of im- age segmentation techniques for layup defect detection in the auto- mated fiber placement process (May 2021), https://doi.org/10.1007/ s10845-021-01774-3, URL http://dx.doi.org/10.1007/s10845-021-01774-3
-
[10]
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics9(1), 62–66 (1979), https: //doi.org/10.1109/TSMC.1979.4310076
-
[11]
Rahaman, J., Sing, M.: An efficient multilevel thresholding based satellite imagesegmentationapproachusinganewadaptivecuckoosearchalgorithm. Expert Systems with Applications174, 114633 (2021), ISSN 0957-4174, https://doi.org/https://doi.org/10.1016/j.eswa.2021.114633, URL https:// www.sciencedirect.com/science/article/pii/S0957417421000749
-
[12]
Ramesh, K., Kumar, G.K., Swapna, K., Datta, D., Rajest, S.S.: A review of medical image segmentation algorithms. EAI Endorsed Transactions on Pervasive Health and Technology7(27) (4 2021), https://doi.org/10.4108/ eai.12-4-2021.169184
-
[13]
Revue d’Intelligence Artificielle34(5) (2020)
Samantaray, L., Hembram, S., Panda, R.: A new harris hawks-cuckoo search optimizer for multilevel thresholding of thermogram images. Revue d’Intelligence Artificielle34(5) (2020)
2020
-
[14]
Tuba, M.: Multilevel image thresholding by nature-inspired algorithms - a short review. Comput. Sci. J. Moldova (Nov 2014), URL https://www.semanticscholar.org/paper/ Multilevel-image-thresholding-by-nature-inspired-A-Tuba/ 714fb208f821b4f1f433c3f0ba836e31f496f5fe
2014
-
[15]
Van Thieu, N., Mirjalili, S.: Mealpy: An open-source library for latest meta- heuristicalgorithmsinpython.JournalofSystemsArchitecture139,102871 (Jun 2023), ISSN 1383-7621, https://doi.org/10.1016/j.sysarc.2023.102871
-
[16]
IEEE Transactions on Im- age Processing13(4), 600–612 (2004), https://doi.org/10.1109/TIP.2003
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Im- age Processing13(4), 600–612 (2004), https://doi.org/10.1109/TIP.2003. 819861
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.