pith. sign in

arxiv: 2605.29446 · v1 · pith:QAC2NGDAnew · submitted 2026-05-28 · 💻 cs.AI

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

Pith reviewed 2026-06-29 07:25 UTC · model grok-4.3

classification 💻 cs.AI
keywords XRD peak indexingMiller indicesvision-language modelsbenchmarkcrystallographymultimodal evaluationscientific figure understanding
0
0 comments X

The pith

Vision-language models reach at most 0.5888 Jaccard score when recovering Miller indices from XRD images plus CIF data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark to test whether vision-language models can extract peak locations from rendered powder XRD curves and then carry out the crystallographic reasoning to assign the correct set of Miller indices. It evaluates seven models on 250 samples that pair each image with its source CIF file and formula, separating visual reading errors from reasoning errors. The strongest result is a Jaccard score of 0.5888 and an exact-match rate of 37.6 percent, with six models staying below 0.5 Jaccard; performance drops sharply on double-peak patterns and gains little from the provided CIF text. A reader would care because reliable automated indexing would speed materials identification, yet the results show current models still cannot handle this quantitative scientific figure task at usable accuracy.

Core claim

CrystalXRD-Bench supplies each rendered XRD image together with its originating CIF text and chemical formula so visual extraction can be measured separately from crystallographic calculation; on the single task of recovering every HKL that contributes to the highest-intensity peak, the best model reaches a Jaccard score of 0.5888 with 37.6 percent exact matches while six of seven models remain below 0.50, and error patterns reveal systematic weaknesses on double-peak cases, over-prediction by recall-heavy models, and limited help from CIF access.

What carries the argument

CrystalXRD-Bench, the 250-sample test set that requires recovering the complete set of HKLs for the highest-intensity peak from a rendered XRD image paired with CIF text and formula.

If this is right

  • Double-peak XRD patterns remain especially difficult for current models.
  • Recall-oriented models increase coverage mainly by over-predicting HKLs rather than improving extraction accuracy.
  • Access to the source CIF text does not close the gap in performing the required crystallographic calculations.
  • The benchmark isolates concrete conditions under which vision-language models fail on quantitative scientific figures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Specialized visual processing for curve and peak data may be required before these models can support routine crystallography workflows.
  • Extending the same evaluation protocol to full-pattern indexing or to weaker peaks would likely reveal additional failure modes.
  • Hybrid systems that combine vision-language output with external symbolic crystallography routines could be tested directly on this benchmark.

Load-bearing premise

The 250 samples drawn from ten public crystallographic databases form a representative test without unaccounted confounding factors from peak rendering or database choice.

What would settle it

A model that reaches Jaccard scores above 0.8 on the full set of 250 samples, or a controlled study showing that rendering choices alone produce the observed error rates.

Figures

Figures reproduced from arXiv: 2605.29446 by Beng Wang, Bing Zhao, Chengliang Xu, Hu Wei, Peiyao Xiao, Xiaogang Li.

Figure 1
Figure 1. Figure 1: Pipeline of CrystalXRD-Bench. Starting from CIF crystal structures from 10 databases, we [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset composition of CrystalXRD-Bench: (a) difficulty distribution (union size) per database, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall performance across four base metrics. Horizontal dashed lines mark tier boundaries (T1: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Precision-recall scatter with iso-F1 curves. Point size encodes average predicted set size. GPT-5.4 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of over-prediction penalty: dumbbell chart comparing Jaccard (left) and [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-dataset Jaccard heatmap (7 models × 10 databases). Gold borders mark each row’s maximum. GPT-5.4 is best on 9 of 10 databases; only QMOF favors Gemini 3.1. The MOF datasets (QMOF, hMOF) tend to yield higher scores, while GNoME remains difficult across models. crystal-system composition within each union-size group, an analysis we leave to future work. This non-monotonicity, in turn, illustrates the bro… view at source ↗
Figure 7
Figure 7. Figure 7: Performance decomposition: (a) Jaccard vs. difficulty level (union size). GPT-5.4 leads all three [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Representative XRD patterns from CrystalXRD-Bench, ordered from easier to harder cases. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Radar charts for all models over six dimensions: Jaccard, Precision, Recall, Exact Match Rate, [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Top-two model comparison (GPT-5.4 vs. Gemini 3.1 Pro): (a) waterfall chart of per-metric [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CrystalXRD-Bench, a 250-sample benchmark drawn from 10 public crystallographic databases. Each sample pairs a rendered XRD image with the source CIF text and chemical formula; the single task is to recover the full set of HKLs contributing to the highest-intensity peak. Seven vision-language models are evaluated on Jaccard overlap and exact-match rate. The best result is Jaccard 0.5888 and 37.6% exact match (GPT-5.4); six of seven models remain below Jaccard 0.50. The authors conclude that the task is far from solved, document systematic error patterns (double-peak brittleness, recall-heavy over-prediction, limited benefit from CIF text), and state that all data and evaluation code will be released publicly.

Significance. If the benchmark construction is shown to be free of rendering artifacts, the work supplies a concrete, reproducible test of visual quantitative-figure reading plus multi-step crystallographic reasoning that is absent from existing multimodal benchmarks. The planned public release of data and code is a clear strength that would allow direct follow-up and extension.

major comments (2)
  1. [Abstract / Dataset Construction] Abstract and (presumed) §3 Dataset Construction: the description of the rendered XRD images supplies no parameters for peak profile (Gaussian/Lorentzian), width, noise level, axis scaling, or intensity normalization. Because the central claim is that low Jaccard scores reflect genuine limits in visual peak extraction plus HKL reasoning, the absence of these details is load-bearing; idealized delta-function peaks or fixed widths could create failure modes absent from experimental data.
  2. [Abstract / Dataset Construction] Abstract and (presumed) §3: the 250 samples are drawn from 10 databases, yet no breakdown by crystal system, space-group complexity, or peak-overlap statistics is provided. Without this distribution, it is impossible to verify that the reported performance gap is not driven by an unrepresentative subset of easy or hard cases.
minor comments (1)
  1. [Abstract] Abstract: the model identifier 'GPT-5.4' should be clarified (version, API date, or whether it is a hypothetical label) so that the ranking can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on dataset construction. We agree that both points require clarification and will revise the manuscript accordingly to strengthen the work.

read point-by-point responses
  1. Referee: [Abstract / Dataset Construction] Abstract and (presumed) §3 Dataset Construction: the description of the rendered XRD images supplies no parameters for peak profile (Gaussian/Lorentzian), width, noise level, axis scaling, or intensity normalization. Because the central claim is that low Jaccard scores reflect genuine limits in visual peak extraction plus HKL reasoning, the absence of these details is load-bearing; idealized delta-function peaks or fixed widths could create failure modes absent from experimental data.

    Authors: We agree that the rendering parameters are essential for interpreting the results and supporting the claim that performance gaps reflect model limitations rather than artifacts. The original manuscript omitted these specifics in §3. In the revision we will add a precise description of the peak profile (Gaussian with specified FWHM), noise level, axis scaling, and intensity normalization, including the exact parameter values used when generating images from the CIF files. revision: yes

  2. Referee: [Abstract / Dataset Construction] Abstract and (presumed) §3: the 250 samples are drawn from 10 databases, yet no breakdown by crystal system, space-group complexity, or peak-overlap statistics is provided. Without this distribution, it is impossible to verify that the reported performance gap is not driven by an unrepresentative subset of easy or hard cases.

    Authors: We acknowledge that a statistical breakdown of the 250 samples is needed to demonstrate representativeness. In the revised manuscript we will include in §3 (or a new table) the distribution by crystal system, space-group complexity metrics (e.g., number of atoms per unit cell or number of possible HKLs), and peak-overlap statistics (e.g., fraction of samples with overlapping peaks near the highest-intensity reflection). This will allow readers to assess whether the performance gap is driven by dataset composition. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential steps

full rationale

The paper introduces CrystalXRD-Bench as a 250-sample dataset drawn from public crystallographic databases and reports direct empirical evaluations of seven vision-language models on the task of HKL indexing from rendered XRD images. No equations, fitted parameters, predictions derived from prior fits, or load-bearing self-citations appear in the abstract or described content. The central claims (Jaccard scores, exact-match rates, error patterns) rest on straightforward model evaluation against the new benchmark rather than any reduction to inputs by construction. This is a standard empirical benchmark paper whose results are externally falsifiable via the released data and code.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities; benchmark construction details are not provided.

pith-pipeline@v0.9.1-grok · 5758 in / 1115 out tokens · 28960 ms · 2026-06-29T07:25:09.160345+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Cullity and S.R

    B.D. Cullity and S.R. Stock.Elements of X-Ray Diffraction. Prentice Hall, 3rd edition, 2001

  2. [2]

    DVQA: Understanding data visualizations via question answering

    Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. DVQA: Understanding data visualizations via question answering. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5648–5656, 2018

  3. [3]

    Khapra, and Pratyush Kumar

    Nitesh Methani, Pritha Ganguly, Mitesh M. Khapra, and Pratyush Kumar. PlotQA: Reasoning over scientific plots. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1527–1536, 2020

  4. [4]

    ChartQA:Abenchmark for question answering about charts with visual and logical reasoning

    AhmedMasry,DoXuanLong,JiaQingTan,ShafiqJoty,andEnamulHoque. ChartQA:Abenchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, 2022

  5. [5]

    arXiv preprint arXiv:2312.15915 , year=

    ZhengzhuoXuetal. ChartBench: Abenchmarkforcomplexvisualreasoningincharts.arXivpreprint arXiv:2312.15915, 2023. 13

  6. [6]

    FigureQA: An Annotated Figure Dataset for Visual Reasoning

    Samira Ebrahimi Kahou, Vincent Michalski, Adam Atkinson, Ákos Kádár, Adam Trischler, and Yoshua Bengio. FigureQA: An annotated figure dataset for visual reasoning. InICLR 2018 Workshop on Learning to Reason, 2018. URLhttps://arxiv.org/abs/1710.07300. arXiv:1710.07300

  7. [7]

    MMC: Advancing multimodal chart understanding with large-scale instruction tuning

    Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, and Yaser Yacoob. MMC: Advancing multimodal chart understanding with large-scale instruction tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. URL https://arxiv.org/abs/2311.10774. arXiv:...

  8. [8]

    ChartX & ChartVLM: A versatile benchmark and foundation model for complicated chart reasoning

    Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Shi, Tao Yan, Junchi Jiang, et al. ChartX & ChartVLM: A versatile benchmark and foundation model for complicated chart reasoning. InProceedings of the 32nd ACM International Conference on Multimedia (MM), 2024. URLhttps://arxiv.org/abs/2402.12185. arXiv:2402.12185

  9. [9]

    LLM4Mat- Bench: Benchmarking large language models for materials property knowledge

    AndreNiyongaboRubungo,KyleSwanson,TianXie,BoruiPeng,andJeffreyC.Grossman. LLM4Mat- Bench: Benchmarking large language models for materials property knowledge. InNeurIPS 2024 Workshop on AI for Accelerated Materials Design (AI4Mat), 2024

  10. [10]

    MatSci-NLP: Evaluating scientific language models on materials science language tasks using text-to-schema modeling

    Yu Song, Santiago Miret, and Bang Liu. MatSci-NLP: Evaluating scientific language models on materials science language tasks using text-to-schema modeling. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), pages 3621–3639, 2023. URL https://arxiv.org/abs/2305.08264. arXiv:2305.08264

  11. [11]

    Classification of crystal structure using a convolutional neural network.IUCrJ, 4(4):486–494, 2017

    Won Bae Park, Jiyong Chung, Jaeyoung Jung, Keemin Sohn, Satendra Pal Singh, Myoungho Pyo, Namsoo Shin, and Kee-Sun Sohn. Classification of crystal structure using a convolutional neural network.IUCrJ, 4(4):486–494, 2017

  12. [12]

    A deep-learning technique for phase identification in multiphase inorganic compounds using synthetic XRD powder patterns.Nature Communications, 11:86, 2020

    Jong-Won Lee, Won Bae Park, Jin Hee Lee, Satendra Pal Singh, and Kee-Sun Sohn. A deep-learning technique for phase identification in multiphase inorganic compounds using synthetic XRD powder patterns.Nature Communications, 11:86, 2020

  13. [13]

    Rosen, Shaelyn M

    Andrew S. Rosen, Shaelyn M. Iyer, Debmalya Ray, Zhenpeng Yao, Alán Aspuru-Guzik, Laura Gagliardi, Justin M. Notestein, and Randall Q. Snurr. Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery.Matter, 4(5):1578–1597, 2021

  14. [14]

    Wilmer, Michael Leaf, Crystal Y

    Christopher E. Wilmer, Michael Leaf, Crystal Y. Lee, Omar K. Farha, Brad G. Hauser, Joseph T. Hupp,andRandallQ.Snurr. Large-scalescreeningofhypotheticalmetal–organicframeworks.Nature Chemistry, 4:83–89, 2012

  15. [15]

    Borysov, R

    Stanislav S. Borysov, R. Matthias Geilhufe, and Alexander V. Balatsky. Organic materials database: An open-access online database for data mining.PLoS ONE, 12(2):e0171501, 2017

  16. [16]

    Database of Cantor high-entropy alloys.Scientific Data, 2024

    Gowoon Cheon et al. Database of Cantor high-entropy alloys.Scientific Data, 2024. Cantor High-Entropy Alloy Database

  17. [17]

    Garrity, Andrew C.E

    Kamal Choudhary, Kevin F. Garrity, Andrew C.E. Reid, Brian DeCost, Adam J. Biacchi, Angela R. Hight Walker, Francesca Tavazza, Ichiro Takeuchi, Alden A. Dima, and Jason R. Hattrick-Simpers. The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design.npj Computational Materials, 6:173, 2020

  18. [18]

    Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, and Kristin A. Persson. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation.APL Materials, 1(1):011002, 2013

  19. [19]

    SNUMat: A large materials database for machine learning.npj Computational Materials, 2023

    Jaehyeok Lee et al. SNUMat: A large materials database for machine learning.npj Computational Materials, 2023. 14

  20. [20]

    Saal, Scott Kirklin, Muratahan Aykol, Bryce Meredig, and C

    James E. Saal, Scott Kirklin, Muratahan Aykol, Bryce Meredig, and C. Wolverton. Materials design anddiscoverywithhigh-throughputdensityfunctionaltheory: TheOpenQuantumMaterialsDatabase (OQMD).JOM, 65:1501–1509, 2013

  21. [21]

    Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk

    Amil Merchant, Simon Batzner, Samuel S. Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery.Nature, 624:80–85, 2023

  22. [22]

    Chevrier, Kristin A

    Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Ceder. Python Materials Genomics (pymatgen): A robust, open-source Python library for materials analysis. Computational Materials Science, 68:314–319, 2013

  23. [23]

    Thompson, D.E

    P. Thompson, D.E. Cox, and J.B. Hastings. Rietveld refinement of Debye–Scherrer synchrotron X-ray data from Al2O3.Journal of Applied Crystallography, 20:79–83, 1987

  24. [24]

    The distribution of the flora in the alpine zone.New Phytologist, 11(2):37–50, 1912

    Paul Jaccard. The distribution of the flora in the alpine zone.New Phytologist, 11(2):37–50, 1912

  25. [25]

    A review on multi-label learning algorithms.IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014

    Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms.IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014

  26. [26]

    Gražulis, D

    SauliusGražulis,DanielChateigner,RobertT.Downs,A.F.T.Yokochi,MiguelQuirós,LucaLutterotti, Elena Manakova, Justas Butkus, Peter Moeck, and Armel Le Bail. Crystallography open database - an open-accesscollectionofcrystalstructures.JournalofAppliedCrystallography,42(4):726–729,2009. doi: 10.1107/S0021889809016690. URLhttps://doi.org/10.1107/S0021889809016690...

  27. [27]

    An XRD pattern image (intensity vs. 2theta)

  28. [28]

    The CIF of the material: <CIF_TEXT>

  29. [29]

    max_peak_hkls

    The chemical formula: <FORMULA> Task: Identify ALL Miller indices (hkl) for the HIGHEST peak. It may result from superposition of multiple crystallographic planes. Return JSON: {"max_peak_hkls": [[h,k,l], ...]} For hexagonal and trigonal systems, the prompt switches to 4-index Miller-Bravais notation[hkil] with the constrainti=−(h+k). B Detailed Per-Datas...