Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset
Pith reviewed 2026-05-25 04:25 UTC · model grok-4.3
The pith
A machine learning framework automates emerald gemstone grading by matching stones to reference images and reaches 98 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The framework uses image acquisition in a dedicated chamber followed by feature extraction and machine learning classification to categorize emeralds according to reference stones, achieving 98% accuracy on the dataset of 192 images, which outperforms a deep learning approach, and the dataset is made public.
What carries the argument
Image acquisition chamber combined with extracted and pre-processed features fed into a machine learning classifier for matching to reference stones.
Load-bearing premise
The image acquisition chamber and the extracted features encode the same grading criteria that human specialists use when comparing stones to references.
What would settle it
Testing the framework on a fresh set of emeralds graded by multiple independent specialists and finding frequent mismatches with the majority human consensus.
read the original abstract
The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an end-to-end machine-learning framework for emerald gemstone grading that combines a custom image-acquisition chamber, hand-crafted feature extraction, and a classifier. It reports 98% accuracy on a newly created public dataset of 192 images and claims to outperform a deep-learning baseline while removing human subjectivity from the reference-stone matching process.
Significance. A reproducible, objective grading system for emeralds would address a long-standing practical problem in gemology. The release of the 192-image dataset with extracted features is a concrete contribution that could enable future benchmarking. However, the absence of any reported validation protocol, label provenance, or statistical controls means the 98% figure cannot yet be treated as evidence that the framework encodes the same criteria used by specialists.
major comments (3)
- [Abstract / Results] Abstract and Results section: the central claim of 98% accuracy (and superiority over deep learning) is presented without any description of the train-test split, cross-validation procedure, number of folds, or error bars. On a dataset of only 192 images this information is load-bearing for interpreting whether the result reflects generalization or overfitting.
- [Dataset / Methods] Dataset creation paragraph: although the text acknowledges that emerald grading is subjective and that specialists may disagree on reference-stone matches, no information is supplied on how the ground-truth labels for the 192 images were produced (single grader, consensus of several, or reference to an external standard) and no inter-rater agreement statistic is reported. High label noise would render both the 98% figure and the deep-learning comparison difficult to interpret.
- [Results] Comparison to deep learning: the manuscript states that the proposed framework outperforms a deep-learning approach, yet supplies no details on the architecture, training regime, data augmentation, or hyper-parameter search used for the baseline. Without these specifics the performance comparison cannot be evaluated.
minor comments (2)
- [Abstract] The abstract claims the framework 'discards the subjective decisions made by specialists,' but the system still relies on human-labeled training data; this tension should be clarified.
- [Figures / Tables] Figure captions and table headings should explicitly state the number of images per class and the exact feature dimensionality after preprocessing.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate the requested information where it is currently absent.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results section: the central claim of 98% accuracy (and superiority over deep learning) is presented without any description of the train-test split, cross-validation procedure, number of folds, or error bars. On a dataset of only 192 images this information is load-bearing for interpreting whether the result reflects generalization or overfitting.
Authors: We agree that the validation protocol must be described explicitly. The revised manuscript will add a clear account of the train-test split used, whether cross-validation was performed and with how many folds, and any error bars or confidence intervals accompanying the 98% accuracy figure. This will allow readers to assess generalization versus potential overfitting on the small dataset. revision: yes
-
Referee: [Dataset / Methods] Dataset creation paragraph: although the text acknowledges that emerald grading is subjective and that specialists may disagree on reference-stone matches, no information is supplied on how the ground-truth labels for the 192 images were produced (single grader, consensus of several, or reference to an external standard) and no inter-rater agreement statistic is reported. High label noise would render both the 98% figure and the deep-learning comparison difficult to interpret.
Authors: We will revise the Dataset section to specify exactly how the ground-truth labels were assigned (including the number of graders and the procedure followed) and will report any available inter-rater agreement statistics or explicitly discuss the limitations of the labeling process used. This addresses the concern about potential label noise. revision: yes
-
Referee: [Results] Comparison to deep learning: the manuscript states that the proposed framework outperforms a deep-learning approach, yet supplies no details on the architecture, training regime, data augmentation, or hyper-parameter search used for the baseline. Without these specifics the performance comparison cannot be evaluated.
Authors: The revised Results section will include complete specifications of the deep-learning baseline: the network architecture, training regime, data augmentation strategies, and hyper-parameter search procedure. These details will make the performance comparison reproducible and evaluable. revision: yes
Circularity Check
No significant circularity; standard empirical ML evaluation on created dataset.
full rationale
The paper describes an image acquisition chamber, feature extraction, ML model training on a new 192-image emerald dataset, and reports classification accuracy against ground-truth labels. No derivation chain, equations, or self-citations are presented that reduce the accuracy claim to a fitted input by construction. The 98% figure is the direct output of supervised training and evaluation on the authors' own data, which is the expected reporting format for such work rather than a tautological redefinition. No load-bearing self-citation, ansatz smuggling, or uniqueness theorem is invoked. The central claim remains an empirical result on the provided dataset.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual features extracted from controlled images are sufficient to represent the grading criteria used by gemologists
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach... features f5–f20: Histogram comparison using the Bhattacharrya distance... GLCM homogeneity and entropy
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We captured a total of 192 emerald stones, 24 images belonging to each of the 8 categories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Electr Eng Comput Sci 43: 1997
Alsabti K, Ranka S, Singh V (1997) An efficient k-means cluster- ing algorithm. Electr Eng Comput Sci 43: 1997. https://surface. syr. edu/eecs/43/
work page 1997
- [2]
-
[3]
Cha SH, Srihari SN (2002) On measuring the distance between histograms. Pattern Recognit 35:1355-1370
work page 2002
-
[4]
IEEE Trans Knowl Data Eng 8:866-883
Chen M, Han J, Yu P (2005) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8:866-883
work page 2005
-
[5]
htt gemsociety.org/article/a-consumers-guide-to-gem-grading/
Clark D (2019) A consumer’s guide to gem grading. htt gemsociety.org/article/a-consumers-guide-to-gem-grading/
work page 2019
-
[6]
Crabi D et al. (2020). https://github.com/DaniloRicardoCrabi/ Emeralds-. git
work page 2020
-
[7]
Dominguez-Lopez JA, Damper RI, Crowder RM, Harris CJ (2004) Adaptive neurofuzzy control of a robotic gripper with on- line machine learning. Robot Auton Syst 48:93-110
work page 2004
-
[8]
In: International conference on image processing theory, tools and applications
Dubuisson S (2010) The computation of the Bhattacharyya dis- tance between histograms without histograms. In: International conference on image processing theory, tools and applications
work page 2010
- [9]
-
[10]
Frank E, Hall MA, Witten IH (2016) Data mining: practical machine learning tools and techniques. IEEE J Biomed Health. Tnform 5(51):2006. https://biomedical-engineering-online.biome dcentral.com/articles/10.1186/1475-925X-5-51:
-
[11]
https:/geology.com/gemstones/emerald/
Geology (2018). https:/geology.com/gemstones/emerald/
work page 2018
-
[12]
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3-42
work page 2006
-
[13]
Giuliani G, Groat LA, Marshall D, Fallick AE, Branquet Y (2019) Emerald deposits: a review and enhanced classification. Minerals 9:105
work page 2019
-
[14]
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten 1H (2009) The weka data mining software: an update. SIGKDD Explor 11:10-18
work page 2009
-
[15]
Data Mining, Inference, and Prediction
Hastie T, Tibshirani R, Friedman J (2009) The elements of statisti- cal learning. Data Mining, Inference, and Prediction. pp 485-585. https://doi.org/10.1007/978-0-387-84858-7
-
[16]
Instituto Gemológico da América (2019). Emerald quality factors. https:/fAwww. gia.edu/emerald-quality-factor 20. 21 22. 23. 24. 25. . McClure SF, Moses TM, Tannous M, Koivula JT (1999) Class Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classifi- cation with deep convolutional neural networks. Commun ACM 60:84-90 . Manson DV, Stockton CM (1982) Ge...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.