Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

D Crabi; \'Erick O Rodrigues; FB Pena; G Bernardes; Sandro C Izidoro

arxiv: 2605.23777 · v1 · pith:7EQWCUKQnew · submitted 2026-05-22 · 💻 cs.CV

Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset

FB Pena , D Crabi , Sandro C Izidoro , \'Erick O Rodrigues , G Bernardes This is my paper

Pith reviewed 2026-05-25 04:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords emerald gradinggemstone classificationmachine learningimage processingpublic datasetcomputer visionclassification accuracy

0 comments

The pith

A machine learning framework automates emerald gemstone grading by matching stones to reference images and reaches 98 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to replace subjective manual grading of emeralds by gemologists with an automated system that uses image processing and machine learning. It builds a complete pipeline from capturing images in a controlled chamber to categorizing stones based on extracted features. A sympathetic reader would care because this could make grading consistent and objective, reducing disagreements between specialists. The work also provides the first public dataset for this task, enabling further research.

Core claim

The framework uses image acquisition in a dedicated chamber followed by feature extraction and machine learning classification to categorize emeralds according to reference stones, achieving 98% accuracy on the dataset of 192 images, which outperforms a deep learning approach, and the dataset is made public.

What carries the argument

Image acquisition chamber combined with extracted and pre-processed features fed into a machine learning classifier for matching to reference stones.

Load-bearing premise

The image acquisition chamber and the extracted features encode the same grading criteria that human specialists use when comparing stones to references.

What would settle it

Testing the framework on a fresh set of emeralds graded by multiple independent specialists and finding frequent mismatches with the majority human consensus.

read the original abstract

The grading of gemstones is currently a manual procedure performed by gemologists. A popular approach uses reference stones, where those are visually inspected by specialists that decide which one of the available reference stone is the most similar to the inspected stone. This procedure is very subjective as different specialists may end up with different grading choices. This work proposes a complete framework that entails the image acquisition and goes up to the final stone categorization. The proposal is able to automate the entire process apart from including the stone in the created chamber for the image acquisition. It discards the subjective decisions made by specialists. This is the first work to propose a machine learning approach coupled with image processing techniques for emerald grading. The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach. Furthermore, we also create and publish the used dataset that contains 192 images of emerald stones along with their extracted and pre-processed features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is the public 192-image emerald dataset; the 98% accuracy claim rests on thin validation details that the abstract does not supply.

read the letter

The main takeaway is that this work releases a new public dataset of emerald images and applies standard image processing plus classification to automate a narrow grading task. That dataset is the part that could actually be used by others. The framework itself follows the usual pipeline of chamber-based capture, feature extraction, and a classifier, with a reported 98% accuracy that beats their deep-learning baseline. Releasing the data with pre-processed features is a concrete step forward for anyone who wants to experiment in this specific domain without building the collection from scratch. The authors correctly note that manual grading is subjective, which sets up the motivation for an automated system. The methods are not novel in computer vision terms, but the domain application and data release are new relative to the cited literature. The soft spots sit in the evaluation. The abstract gives no information on how the ground-truth labels were produced, whether multiple graders were involved, or what inter-rater agreement looked like. Given the acknowledged subjectivity, that missing detail makes the 98% figure difficult to interpret as evidence that the system captures the same criteria specialists use. Train-test splits, cross-validation, and error bars are also not described, so the performance comparison cannot be checked from the given text. The deep-learning baseline is mentioned without training details. This paper is for researchers in applied computer vision who work on specialized inspection problems or for gemology groups that might adopt consistency tools. A reader looking for a ready-to-use small dataset in a commercial niche would find it useful. The central argument holds up as an applied demonstration, but the missing validation steps are the main limitation. I would send it for peer review because the dataset is new and the task is clearly scoped, even though the methods section will need expansion and the accuracy claims will require more scrutiny.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an end-to-end machine-learning framework for emerald gemstone grading that combines a custom image-acquisition chamber, hand-crafted feature extraction, and a classifier. It reports 98% accuracy on a newly created public dataset of 192 images and claims to outperform a deep-learning baseline while removing human subjectivity from the reference-stone matching process.

Significance. A reproducible, objective grading system for emeralds would address a long-standing practical problem in gemology. The release of the 192-image dataset with extracted features is a concrete contribution that could enable future benchmarking. However, the absence of any reported validation protocol, label provenance, or statistical controls means the 98% figure cannot yet be treated as evidence that the framework encodes the same criteria used by specialists.

major comments (3)

[Abstract / Results] Abstract and Results section: the central claim of 98% accuracy (and superiority over deep learning) is presented without any description of the train-test split, cross-validation procedure, number of folds, or error bars. On a dataset of only 192 images this information is load-bearing for interpreting whether the result reflects generalization or overfitting.
[Dataset / Methods] Dataset creation paragraph: although the text acknowledges that emerald grading is subjective and that specialists may disagree on reference-stone matches, no information is supplied on how the ground-truth labels for the 192 images were produced (single grader, consensus of several, or reference to an external standard) and no inter-rater agreement statistic is reported. High label noise would render both the 98% figure and the deep-learning comparison difficult to interpret.
[Results] Comparison to deep learning: the manuscript states that the proposed framework outperforms a deep-learning approach, yet supplies no details on the architecture, training regime, data augmentation, or hyper-parameter search used for the baseline. Without these specifics the performance comparison cannot be evaluated.

minor comments (2)

[Abstract] The abstract claims the framework 'discards the subjective decisions made by specialists,' but the system still relies on human-labeled training data; this tension should be clarified.
[Figures / Tables] Figure captions and table headings should explicitly state the number of images per class and the exact feature dimensionality after preprocessing.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate the requested information where it is currently absent.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results section: the central claim of 98% accuracy (and superiority over deep learning) is presented without any description of the train-test split, cross-validation procedure, number of folds, or error bars. On a dataset of only 192 images this information is load-bearing for interpreting whether the result reflects generalization or overfitting.

Authors: We agree that the validation protocol must be described explicitly. The revised manuscript will add a clear account of the train-test split used, whether cross-validation was performed and with how many folds, and any error bars or confidence intervals accompanying the 98% accuracy figure. This will allow readers to assess generalization versus potential overfitting on the small dataset. revision: yes
Referee: [Dataset / Methods] Dataset creation paragraph: although the text acknowledges that emerald grading is subjective and that specialists may disagree on reference-stone matches, no information is supplied on how the ground-truth labels for the 192 images were produced (single grader, consensus of several, or reference to an external standard) and no inter-rater agreement statistic is reported. High label noise would render both the 98% figure and the deep-learning comparison difficult to interpret.

Authors: We will revise the Dataset section to specify exactly how the ground-truth labels were assigned (including the number of graders and the procedure followed) and will report any available inter-rater agreement statistics or explicitly discuss the limitations of the labeling process used. This addresses the concern about potential label noise. revision: yes
Referee: [Results] Comparison to deep learning: the manuscript states that the proposed framework outperforms a deep-learning approach, yet supplies no details on the architecture, training regime, data augmentation, or hyper-parameter search used for the baseline. Without these specifics the performance comparison cannot be evaluated.

Authors: The revised Results section will include complete specifications of the deep-learning baseline: the network architecture, training regime, data augmentation strategies, and hyper-parameter search procedure. These details will make the performance comparison reproducible and evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard empirical ML evaluation on created dataset.

full rationale

The paper describes an image acquisition chamber, feature extraction, ML model training on a new 192-image emerald dataset, and reports classification accuracy against ground-truth labels. No derivation chain, equations, or self-citations are presented that reduce the accuracy claim to a fitted input by construction. The 98% figure is the direct output of supervised training and evaluation on the authors' own data, which is the expected reporting format for such work rather than a tautological redefinition. No load-bearing self-citation, ansatz smuggling, or uniqueness theorem is invoked. The central claim remains an empirical result on the provided dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that image features can substitute for expert visual comparison and on standard supervised-learning assumptions that the training data distribution matches future stones.

axioms (1)

domain assumption Visual features extracted from controlled images are sufficient to represent the grading criteria used by gemologists
The framework depends on this mapping between pixel data and subjective grade labels.

pith-pipeline@v0.9.0 · 5704 in / 1158 out tokens · 22623 ms · 2026-05-25T04:25:21.645014+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The proposed framework achieves 98% of accuracy (correctly categorized stones), outperforming a deep learning approach... features f5–f20: Histogram comparison using the Bhattacharrya distance... GLCM homogeneity and entropy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We captured a total of 192 emerald stones, 24 images belonging to each of the 8 categories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

Electr Eng Comput Sci 43: 1997

Alsabti K, Ranka S, Singh V (1997) An efficient k-means cluster- ing algorithm. Electr Eng Comput Sci 43: 1997. https://surface. syr. edu/eecs/43/

work page 1997
[2]

Mach Learn 45:5-32

Breiman L (2001) Random forests. Mach Learn 45:5-32

work page 2001
[3]

Pattern Recognit 35:1355-1370

Cha SH, Srihari SN (2002) On measuring the distance between histograms. Pattern Recognit 35:1355-1370

work page 2002
[4]

IEEE Trans Knowl Data Eng 8:866-883

Chen M, Han J, Yu P (2005) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8:866-883

work page 2005
[5]

htt gemsociety.org/article/a-consumers-guide-to-gem-grading/

Clark D (2019) A consumer’s guide to gem grading. htt gemsociety.org/article/a-consumers-guide-to-gem-grading/

work page 2019
[6]

Crabi D et al. (2020). https://github.com/DaniloRicardoCrabi/ Emeralds-. git

work page 2020
[7]

Robot Auton Syst 48:93-110

Dominguez-Lopez JA, Damper RI, Crowder RM, Harris CJ (2004) Adaptive neurofuzzy control of a robotic gripper with on- line machine learning. Robot Auton Syst 48:93-110

work page 2004
[8]

In: International conference on image processing theory, tools and applications

Dubuisson S (2010) The computation of the Bhattacharyya dis- tance between histograms without histograms. In: International conference on image processing theory, tools and applications

work page 2010
[9]

https://www

FMI (2018). https://www. futuremarketinsights.com/press-release/ 650

work page 2018
[10]

IEEE J Biomed Health

Frank E, Hall MA, Witten IH (2016) Data mining: practical machine learning tools and techniques. IEEE J Biomed Health. Tnform 5(51):2006. https://biomedical-engineering-online.biome dcentral.com/articles/10.1186/1475-925X-5-51:

work page doi:10.1186/1475-925x-5-51: 2016
[11]

https:/geology.com/gemstones/emerald/

Geology (2018). https:/geology.com/gemstones/emerald/

work page 2018
[12]

Mach Learn 63:3-42

Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3-42

work page 2006
[13]

Minerals 9:105

Giuliani G, Groat LA, Marshall D, Fallick AE, Branquet Y (2019) Emerald deposits: a review and enhanced classification. Minerals 9:105

work page 2019
[14]

SIGKDD Explor 11:10-18

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten 1H (2009) The weka data mining software: an update. SIGKDD Explor 11:10-18

work page 2009
[15]

Data Mining, Inference, and Prediction

Hastie T, Tibshirani R, Friedman J (2009) The elements of statisti- cal learning. Data Mining, Inference, and Prediction. pp 485-585. https://doi.org/10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009
[16]

Emerald quality factors

Instituto Gemológico da América (2019). Emerald quality factors. https:/fAwww. gia.edu/emerald-quality-factor 20. 21 22. 23. 24. 25. . McClure SF, Moses TM, Tannous M, Koivula JT (1999) Class Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classifi- cation with deep convolutional neural networks. Commun ACM 60:84-90 . Manson DV, Stockton CM (1982) Ge...

work page 2019

[1] [1]

Electr Eng Comput Sci 43: 1997

Alsabti K, Ranka S, Singh V (1997) An efficient k-means cluster- ing algorithm. Electr Eng Comput Sci 43: 1997. https://surface. syr. edu/eecs/43/

work page 1997

[2] [2]

Mach Learn 45:5-32

Breiman L (2001) Random forests. Mach Learn 45:5-32

work page 2001

[3] [3]

Pattern Recognit 35:1355-1370

Cha SH, Srihari SN (2002) On measuring the distance between histograms. Pattern Recognit 35:1355-1370

work page 2002

[4] [4]

IEEE Trans Knowl Data Eng 8:866-883

Chen M, Han J, Yu P (2005) Data mining: an overview from a database perspective. IEEE Trans Knowl Data Eng 8:866-883

work page 2005

[5] [5]

htt gemsociety.org/article/a-consumers-guide-to-gem-grading/

Clark D (2019) A consumer’s guide to gem grading. htt gemsociety.org/article/a-consumers-guide-to-gem-grading/

work page 2019

[6] [6]

Crabi D et al. (2020). https://github.com/DaniloRicardoCrabi/ Emeralds-. git

work page 2020

[7] [7]

Robot Auton Syst 48:93-110

Dominguez-Lopez JA, Damper RI, Crowder RM, Harris CJ (2004) Adaptive neurofuzzy control of a robotic gripper with on- line machine learning. Robot Auton Syst 48:93-110

work page 2004

[8] [8]

In: International conference on image processing theory, tools and applications

Dubuisson S (2010) The computation of the Bhattacharyya dis- tance between histograms without histograms. In: International conference on image processing theory, tools and applications

work page 2010

[9] [9]

https://www

FMI (2018). https://www. futuremarketinsights.com/press-release/ 650

work page 2018

[10] [10]

IEEE J Biomed Health

Frank E, Hall MA, Witten IH (2016) Data mining: practical machine learning tools and techniques. IEEE J Biomed Health. Tnform 5(51):2006. https://biomedical-engineering-online.biome dcentral.com/articles/10.1186/1475-925X-5-51:

work page doi:10.1186/1475-925x-5-51: 2016

[11] [11]

https:/geology.com/gemstones/emerald/

Geology (2018). https:/geology.com/gemstones/emerald/

work page 2018

[12] [12]

Mach Learn 63:3-42

Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3-42

work page 2006

[13] [13]

Minerals 9:105

Giuliani G, Groat LA, Marshall D, Fallick AE, Branquet Y (2019) Emerald deposits: a review and enhanced classification. Minerals 9:105

work page 2019

[14] [14]

SIGKDD Explor 11:10-18

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten 1H (2009) The weka data mining software: an update. SIGKDD Explor 11:10-18

work page 2009

[15] [15]

Data Mining, Inference, and Prediction

Hastie T, Tibshirani R, Friedman J (2009) The elements of statisti- cal learning. Data Mining, Inference, and Prediction. pp 485-585. https://doi.org/10.1007/978-0-387-84858-7

work page doi:10.1007/978-0-387-84858-7 2009

[16] [16]

Emerald quality factors

Instituto Gemológico da América (2019). Emerald quality factors. https:/fAwww. gia.edu/emerald-quality-factor 20. 21 22. 23. 24. 25. . McClure SF, Moses TM, Tannous M, Koivula JT (1999) Class Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classifi- cation with deep convolutional neural networks. Commun ACM 60:84-90 . Manson DV, Stockton CM (1982) Ge...

work page 2019