pith. machine review for the scientific record. sign in

arxiv: 2605.05104 · v1 · submitted 2026-05-06 · ❄️ cond-mat.mtrl-sci · cs.AI· cs.DB· cs.LG· stat.AP

Recognition: unknown

Building informative materials datasets beyond targeted objectives

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:21 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.AIcs.DBcs.LGstat.AP
keywords materials sciencedataset constructiondiversity-aware selectiontargeted propertiesuntargeted propertiespredictive performancedata collection
0
0 comments X

The pith

A diversity-aware framework constructs materials datasets that perform well on both targeted and untargeted properties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Materials researchers often collect data focused on particular properties of interest, but this focus can make the datasets less useful for other properties that arise in future work. The paper introduces a framework that uses diversity-aware selection during data collection to keep broad coverage of the materials space. This ensures that predictive models trained on the dataset maintain strong performance on untargeted properties, avoiding drops of up to 40 percent seen without it, while also boosting results for the targeted properties by up to 25 percent. The result is datasets that support a wider range of discovery tasks without requiring entirely new data campaigns.

Core claim

The central discovery is that incorporating diversity-aware selection into dataset construction maximizes informativeness for target properties of interest while preserving and enhancing performance on untargeted properties. In noisy experimental data, this leads to up to 10 percent improvement on untargeted properties and 25 percent on targeted ones relative to alternatives.

What carries the argument

Diversity-aware selection, a method that chooses materials data points to maximize coverage across the materials space while optimizing for specified targets.

If this is right

  • Prediction models for untargeted properties suffer less degradation or even improve compared to random sampling.
  • Datasets become reusable for future objectives without cold-start limitations.
  • Materials coverage increases, supporting unbiased data entries across outcomes.
  • Targeted property predictions gain up to 25 percent accuracy.
  • Overall, datasets mitigate limitations in subsequent modeling campaigns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method might allow labs to build one versatile dataset instead of many specialized ones, saving experimental costs.
  • Similar diversity approaches could apply to data collection in other fields like biology or chemistry experiments.
  • Validating the framework on additional noisy datasets would test how well the gains generalize.

Load-bearing premise

Diversity in the materials space can be quantified from available data without extra measurements and the performance gains hold for other datasets and models.

What would settle it

Construct a new materials dataset using the framework and compare model performance on untargeted properties to a random sample; no improvement would challenge the claim.

Figures

Figures reproduced from arXiv: 2605.05104 by Adji Bousso Dieng, Ashley Dale, Hao Wan, Hongchen Wang, Jason Hattrick-Simpers, Kangming Li, Rafael Espinosa Casta\~neda, Runze Zhang, Yonatan Kurniawan.

Figure 1
Figure 1. Figure 1: Overview of data selection pipeline for each iteration when explicitly incorporating diversity in feature space. The image shows how 1% of the pool data is selected. For each iteration, performance of the two models-Random Forest and XGBoost on hold-out test dataset is recorded. The pipeline is repeated until the complete pool data has been selected. plicitly account for diversity. The analysis is per￾form… view at source ↗
Figure 2
Figure 2. Figure 2: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool JARVIS18. sity mechanisms in constructing robust, broadly informative datasets. 3 Results In this section, we present representative results using the DFT and the sysTEM (experimental) datasets as material pools. These results illustrate the behavior of our fr… view at source ↗
Figure 3
Figure 3. Figure 3: Data Manifold coverage comparison without and with diversity for single target DFT dataset building. Notoriously, skewed distributions are obtained even with 50% of sampled data when unconsidering diversity. dataset (here ∼ 3.41%)—occurs for QBC bandgap sampling (Fig. 2a)). In this worst case sce￾nario, RMSE saturation for bandgap prediction is achieved when approximately 40% of the pool is sampled using Q… view at source ↗
Figure 4
Figure 4. Figure 4: Random Forest improvement of all policies respect to random sampling with single outcome targeting in DFT datasets construction. lead up to ∼ 13% performance degradation re￾spect to random sampling with Random Forests and up to ∼ 17% with XGBoost. Moreover, only three diversity-aware cases, evaluated with the Random Forest regressor, show standard deviation bars overlapping the random sampling mean: MP21 t… view at source ↗
Figure 5
Figure 5. Figure 5: Random Forest improvement of all policies respect to random sampling with single outcome targeting in experimental datasets construction. Using as pool the sysTEm dataset. On the other hand, the diversity aware per￾formance on untargated properties can be below than random sampling ( view at source ↗
Figure 6
Figure 6. Figure 6: Data Manifold coverage comparison without and with diversity for single target with sysTEm dataset as pool. Global diversity and general coverage is noted without and with diversity. Nevertheless, within clusters without diversity tends to focus in specific regions while diversity aware spans evenly within clusters. ters. This mechanism is particularly important for the experimental dataset, where many mat… view at source ↗
Figure 7
Figure 7. Figure 7: Improvement of all policies respect to random sampling with two outcomes targeting in DFT datasets construction view at source ↗
Figure 8
Figure 8. Figure 8: Data Manifold coverage NSGA-II QBC vs NSGA-II QBC with feature diversity. Globally, both create diverse datasets. However, NSGA-II without diversity still creates skewed distributions even with 50% of sampled data. In all cases, feature diversity-aware sampling performs better than or comparably to random sampling while exploring the full chemical data manifold without producing skewed sampling dis￾tributi… view at source ↗
Figure 9
Figure 9. Figure 9: RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the targetd used for data construction are electrical conductivity, seebeck coefficient and thermal conductivity. ture, and κ thermal conductivity—does not im￾prove predictive performance for zT in experimen￾tal data. This trend is clearly observed in Fig. 9a). For NSGA-II alone and NSGA-II … view at source ↗
Figure 10
Figure 10. Figure 10: Pearson correlation of outcomes variables of DFT datasets view at source ↗
Figure 11
Figure 11. Figure 11: Pearson correlation of outcomes variables of the sysTEm experimental dataset. 19 view at source ↗
Figure 12
Figure 12. Figure 12: Outcomes distribution shown in the first two PCs of feature space view at source ↗
Figure 13
Figure 13. Figure 13: Outcomes distribution shown in the first two PCs of feature space view at source ↗
Figure 14
Figure 14. Figure 14: Outcomes distribution shown in the first two PCs of feature space. 20 view at source ↗
Figure 15
Figure 15. Figure 15: Outcomes distribution shown in the first two PCs of feature space view at source ↗
Figure 16
Figure 16. Figure 16: Outcomes distribution shown in the first two PCs of feature space in the sysTEm dataset. 21 view at source ↗
Figure 17
Figure 17. Figure 17: XGBoost improvement of all policies respect to random sampling with single outcome targeting in experimental datasets construction view at source ↗
Figure 18
Figure 18. Figure 18: Random Forest Improvement of all policies respect to random sampling with two outcomes targeting in experimental datasets construction. Using as pool the sysTEm dataset. 22 view at source ↗
Figure 19
Figure 19. Figure 19: XGBoost improvement of all policies respect to random sampling with two outcomes targeting in experimental datasets construction view at source ↗
Figure 20
Figure 20. Figure 20: Random Forest improvement of all policies respect to random sampling with three outcomes targeting in experimental datasets construction. 23 view at source ↗
Figure 21
Figure 21. Figure 21: XGBoost improvement of all policies respect to random sampling with three outcomes targeting in experimental datasets construction. 24 view at source ↗
Figure 22
Figure 22. Figure 22: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coeffi￾cient and Electrical conductivity when the target used for data construction is Thermal Conductivity view at source ↗
Figure 23
Figure 23. Figure 23: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coeffi￾cient and Electrical conductivity when the target used for data construction is zT. 25 view at source ↗
Figure 24
Figure 24. Figure 24: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coeffi￾cient and Electrical conductivity when the target used for data construction is electrical conductivity view at source ↗
Figure 25
Figure 25. Figure 25: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coeffi￾cient and Electrical conductivity when the target used for data construction is Seebeck coefficient. A.2.2 Two targets Dataset Construction (performance metrics-Random Forest) 26 view at source ↗
Figure 26
Figure 26. Figure 26: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coef￾ficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and Thermal Conductivity view at source ↗
Figure 27
Figure 27. Figure 27: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coef￾ficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and zT. 27 view at source ↗
Figure 28
Figure 28. Figure 28: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Co￾efficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and Thermal Conductivity view at source ↗
Figure 29
Figure 29. Figure 29: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coeffi￾cient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and zT. 28 view at source ↗
Figure 30
Figure 30. Figure 30: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coeffi￾cient and Electrical conductivity when the target used for data construction are zT and Thermal Conductivity. 29 view at source ↗
Figure 31
Figure 31. Figure 31: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coeffi￾cient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Seebeck Coefficient and zT. 30 view at source ↗
Figure 32
Figure 32. Figure 32: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coef￾ficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, zT and Thermal Conductivity view at source ↗
Figure 33
Figure 33. Figure 33: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coef￾ficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient, zT and Thermal Conductivity. 31 view at source ↗
Figure 34
Figure 34. Figure 34: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Thermal Conductivity view at source ↗
Figure 35
Figure 35. Figure 35: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is zT. 32 view at source ↗
Figure 36
Figure 36. Figure 36: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Electrical Conductivity view at source ↗
Figure 37
Figure 37. Figure 37: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Seebeck Coefficient. A.2.5 Two targets Dataset Construction (performance metrics-XGBoost) 33 view at source ↗
Figure 38
Figure 38. Figure 38: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and Thermal Conductivity view at source ↗
Figure 39
Figure 39. Figure 39: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and Thermal Conductivity. 34 view at source ↗
Figure 40
Figure 40. Figure 40: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and zT view at source ↗
Figure 41
Figure 41. Figure 41: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and Thermal Conductivity. 35 view at source ↗
Figure 42
Figure 42. Figure 42: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and zT view at source ↗
Figure 43
Figure 43. Figure 43: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are zT and Thermal Conductivity. A.2.6 Three targets Dataset Construction (performance metrics-XGBoost) 36 view at source ↗
Figure 44
Figure 44. Figure 44: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Seebeck Coefficient and Thermal Conductivity view at source ↗
Figure 45
Figure 45. Figure 45: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Seebeck Coefficient and zT. 37 view at source ↗
Figure 46
Figure 46. Figure 46: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Thermal Conductivity and zT view at source ↗
Figure 47
Figure 47. Figure 47: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Thermal Conductivity and zT. 38 view at source ↗
Figure 48
Figure 48. Figure 48: XGBoost improvement of all policies respect to random sampling with single outcome targeting in DFT datasets construction view at source ↗
Figure 49
Figure 49. Figure 49: XGBoost improvement of all policies respect to random sampling with two outcomes targeting in DFT datasets construction. C JARVIS 18 C.1 Single Target Dataset Construction (performance metrics-Random Forest) 39 view at source ↗
Figure 50
Figure 50. Figure 50: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool JARVIS18 view at source ↗
Figure 51
Figure 51. Figure 51: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool JARVIS18. 40 view at source ↗
Figure 52
Figure 52. Figure 52: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS18. 41 view at source ↗
Figure 53
Figure 53. Figure 53: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS18 view at source ↗
Figure 54
Figure 54. Figure 54: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS18. 42 view at source ↗
Figure 55
Figure 55. Figure 55: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP21 view at source ↗
Figure 56
Figure 56. Figure 56: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP21. 43 view at source ↗
Figure 57
Figure 57. Figure 57: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP21. C.4 Two Targets Dataset Construction (performance metrics-XGBoost) view at source ↗
Figure 58
Figure 58. Figure 58: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS18. 44 view at source ↗
Figure 59
Figure 59. Figure 59: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS18 view at source ↗
Figure 60
Figure 60. Figure 60: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS18. 45 view at source ↗
Figure 64
Figure 64. Figure 64: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool JARVIS22 view at source ↗
Figure 65
Figure 65. Figure 65: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool JARVIS22. 50 view at source ↗
Figure 66
Figure 66. Figure 66: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool JARVIS22. D.2 Two targets Dataset Construction (performance metrics-Random Forest) view at source ↗
Figure 67
Figure 67. Figure 67: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS22. 51 view at source ↗
Figure 68
Figure 68. Figure 68: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS22 view at source ↗
Figure 69
Figure 69. Figure 69: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS22. 52 view at source ↗
Figure 70
Figure 70. Figure 70: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool JARVIS22 view at source ↗
Figure 71
Figure 71. Figure 71: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool JARVIS22. 53 view at source ↗
Figure 72
Figure 72. Figure 72: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool JARVIS22. D.4 Two targets Dataset Construction (performance metrics-XGBoost) view at source ↗
Figure 73
Figure 73. Figure 73: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS22. 54 view at source ↗
Figure 74
Figure 74. Figure 74: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS22 view at source ↗
Figure 75
Figure 75. Figure 75: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS22. 55 view at source ↗
Figure 79
Figure 79. Figure 79: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP18. 57 view at source ↗
Figure 80
Figure 80. Figure 80: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP18 view at source ↗
Figure 81
Figure 81. Figure 81: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP18. 58 view at source ↗
Figure 82
Figure 82. Figure 82: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool MP18. 59 view at source ↗
Figure 83
Figure 83. Figure 83: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool MP18 view at source ↗
Figure 84
Figure 84. Figure 84: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool MP18. 60 view at source ↗
Figure 85
Figure 85. Figure 85: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP18 view at source ↗
Figure 86
Figure 86. Figure 86: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP18. 61 view at source ↗
Figure 87
Figure 87. Figure 87: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP18. E.4 Two Targets Dataset Construction (performance metrics-XGBoost) view at source ↗
Figure 88
Figure 88. Figure 88: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool MP18. 62 view at source ↗
Figure 89
Figure 89. Figure 89: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool MP18 view at source ↗
Figure 90
Figure 90. Figure 90: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool MP18. 63 view at source ↗
Figure 94
Figure 94. Figure 94: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP21. 65 view at source ↗
Figure 95
Figure 95. Figure 95: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP21 view at source ↗
Figure 96
Figure 96. Figure 96: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP21. 66 view at source ↗
Figure 97
Figure 97. Figure 97: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and bulk modulus using as pool MP21. 67 view at source ↗
Figure 98
Figure 98. Figure 98: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and formation energy using as pool MP21 view at source ↗
Figure 99
Figure 99. Figure 99: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are formation energy and bulk modulus using as pool MP21. 68 view at source ↗
Figure 100
Figure 100. Figure 100: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP21 view at source ↗
Figure 101
Figure 101. Figure 101: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP21. 69 view at source ↗
Figure 102
Figure 102. Figure 102: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP21. F.4 Two targets Dataset Construction (performance metrics-XGBoost) view at source ↗
Figure 103
Figure 103. Figure 103: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and bulk modulus using as pool MP21. 70 view at source ↗
Figure 104
Figure 104. Figure 104: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and formation energy using as pool MP21 view at source ↗
Figure 105
Figure 105. Figure 105: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are formation energy and bulk modulus using as pool MP21. 71 view at source ↗
read the original abstract

Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes a diversity-aware framework for constructing materials datasets that prioritizes informativeness for targeted properties while preserving utility for untargeted ones. Using experiments on noisy experimental data, it claims that omitting diversity leads to performance degradations of up to 40% on untargeted properties and 12.5% on targeted properties relative to random sampling, whereas the framework yields gains of up to 10% and 25%, respectively, ensuring broader materials coverage for future tasks.

Significance. If substantiated, the work addresses a practical challenge in materials informatics by promoting dataset construction practices that avoid narrow focus and support long-term reuse in machine learning-driven discovery. The empirical demonstration of performance preservation across targeted and untargeted outcomes could inform data collection strategies in resource-constrained experimental settings.

major comments (3)
  1. [Abstract] Abstract: The specific quantitative claims (degradation up to 40% without diversity and gains up to 25% with it) are presented without any description of the diversity selection algorithm, the input features or descriptors used for diversity computation, the experimental datasets, ML models, baselines, error bars, or statistical tests, rendering the central empirical results unverifiable.
  2. [Abstract] Abstract: The framework's core assumption—that diversity computed from targeted measurements or input features alone suffices to ensure coverage relevant to untargeted properties without extra measurements—is not validated; if the chosen descriptors are orthogonal to certain untargeted outcomes or if experimental noise affects the diversity metric, the reported preservation effect may not hold.
  3. [Abstract] Abstract: No details are given on how diversity-aware selection is implemented for real noisy experimental data (e.g., exact diversity metric, selection procedure, or handling of label noise), which is load-bearing for the claim that the approach avoids costly additional measurements while achieving the stated gains.
minor comments (3)
  1. [Abstract] Typo: 'critical important' should read 'critically important'.
  2. [Abstract] Grammatical issue: 'potentially generate datasets poorly suited' should be rephrased for subject-verb agreement and clarity (e.g., 'potentially generates datasets that are poorly suited').
  3. [Abstract] The abstract would benefit from a concise statement of the diversity metric or algorithm to allow readers to assess reproducibility of the reported improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful review and constructive comments on the abstract. We agree that the abstract's brevity has limited verifiability of the central claims and will revise it to include brief descriptions of the method, data, and models. We also expand the main text to address assumptions and implementation details more explicitly. These changes strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The specific quantitative claims (degradation up to 40% without diversity and gains up to 25% with it) are presented without any description of the diversity selection algorithm, the input features or descriptors used for diversity computation, the experimental datasets, ML models, baselines, error bars, or statistical tests, rendering the central empirical results unverifiable.

    Authors: We acknowledge that the abstract omits these details due to length constraints. The diversity selection algorithm (greedy max-min diversity on compositional descriptors) is described in Section 3.1, input features are elemental and structural descriptors from the Materials Project, experimental datasets are noisy real-world measurements detailed in Section 4, ML models are random forests with random sampling and targeted-only selection as baselines, error bars are from 5-fold cross-validation, and statistical significance is assessed via paired t-tests (p < 0.05). We will revise the abstract to add: 'using greedy diversity selection on compositional descriptors from noisy experimental data, evaluated with random forest models and cross-validation.' Full details remain in the main text. revision: yes

  2. Referee: [Abstract] Abstract: The framework's core assumption—that diversity computed from targeted measurements or input features alone suffices to ensure coverage relevant to untargeted properties without extra measurements—is not validated; if the chosen descriptors are orthogonal to certain untargeted outcomes or if experimental noise affects the diversity metric, the reported preservation effect may not hold.

    Authors: The assumption is supported by empirical results across multiple untargeted properties in noisy experimental settings, where diversity on targeted/input features yields up to 10% gains. We agree that complete orthogonality or severe noise could weaken the effect and will add a limitations paragraph in the Discussion section noting these edge cases and mitigation strategies (e.g., robust covariance-based metrics). The current experiments validate the approach for the properties and noise levels considered, without requiring extra measurements. revision: partial

  3. Referee: [Abstract] Abstract: No details are given on how diversity-aware selection is implemented for real noisy experimental data (e.g., exact diversity metric, selection procedure, or handling of label noise), which is load-bearing for the claim that the approach avoids costly additional measurements while achieving the stated gains.

    Authors: These details appear in Section 3.2: the diversity metric is the determinant of the feature covariance matrix, implemented via greedy batch selection, with label noise handled through replicate averaging and robust model training. This enables use of existing noisy data without additional measurements. We will update the abstract to include: 'via greedy covariance-based diversity selection on noisy experimental data.' revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance results are independent measurements

full rationale

The paper describes an empirical framework for diversity-aware dataset construction in materials science and reports quantitative performance outcomes (e.g., up to 40% degradation without diversity, up to 10% gains with it on untargeted properties) obtained by applying the method to experimental datasets. These are presented as measured results from experiments rather than definitions, fitted parameters renamed as predictions, or reductions via self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that collapse the central claims back to the inputs by construction. The derivation chain consists of standard selection procedures evaluated on held-out data, making the findings self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that diversity-aware selection in materials space will yield the stated performance gains on both targeted and untargeted properties in noisy data; no free parameters, new entities, or additional axioms are identifiable from the abstract alone.

axioms (1)
  • domain assumption Diversity-aware selection in materials feature space ensures broad coverage that preserves and improves performance on untargeted properties
    This assumption is invoked to explain why the framework avoids degradation on untargeted outcomes and supports future objectives.

pith-pipeline@v0.9.0 · 5559 in / 1478 out tokens · 53375 ms · 2026-05-08T16:21:41.644358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 4 canonical work pages

  1. [1]

    J. Liu, Z. Wang, J. Kou and K. Chen,Photon Science, 2026,1, 91–103

  2. [2]

    J. S. Solomon, N. Mrkyvkova, V. Kliner, T. Soto- Montero, I. Fernandez-Guillen, M. Ledinský, P. P. Boix, P. Siffalovic and M. Morales-Masis,npj 2D Ma- terials and Applications, 2025,9, 50

  3. [3]

    G. W. Lee, A. K. Gangopadhyay, K. F. Kelton, R. W. Hyers, T. J. Rathz, J. R. Rogers and D. S. Robinson, Physical review letters, 2004,93, 037802

  4. [4]

    Mascagni, N

    M. Mascagni, N. Tchipev, S. Seckler, M. Heinen, J. Vrabec, F. Gratl, M. Horsch, M. Bern- reuther, C. W. Glass, C. Niethammer, N. Hammer, B. Krischok, M. Resch, D. Kranzlmüller, H. Hasse, H.-J. Bungartz and P. Neumann,Int. J. High Per- form. Comput. Appl., 2019,33, 838–854

  5. [5]

    KADAU, T

    K. KADAU, T. C. GERMANN and P. S. LOMDAHL, International Journal of Modern Physics C, 2006,17, 1755–1761

  6. [6]

    Wines and K

    D. Wines and K. Choudhary,Materials Futures, 2024, 3, 025602

  7. [8]

    A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder and K. A. Persson,APL Materials, 2013,1, 011002

  8. [9]

    Choudhary, K

    K. Choudhary, K. F. Garrity, A. C. E. Reid, B. De- Cost, A. J. Biacchi, A. R. Hight Walker, Z. Trautt, J. Hattrick-Simpers, A. G. Kusne, A. Centrone, A. Davydov, J. Jiang, R. Pachter, G. Cheon, E. Reed, A. Agrawal, X. Qian, V. Sharma, H. Zhuang, S. V. Kalinin, B. G. Sumpter, G. Pilania, P. Acar, S. Man- dal, K.Haule, D.Vanderbilt, K.RabeandF.Tavazza, npj ...

  9. [10]

    Kirklin, J

    S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. Rühl and C. Wolverton, npj Computational Materials, 2015,1, 15010

  10. [11]

    J. Shen, S. D. Griesemer, A. Gopakumar, B. Baldas- sarri, J. E. Saal, M. Aykol, V. I. Hegde and C. Wolver- ton,Journal of Physics: Materials, 2022,5, 031001

  11. [12]

    J. E. Saal, S. Kirklin, M. Aykol, B. Meredig and C. Wolverton,JOM, 2013,65, 1501–1509

  12. [13]

    Divilov, H

    S. Divilov, H. Eckert, S. D. Thiel, S. D. Griesemer, R. Friedrich, N. H. Anderson, M. J. Mehl, D. Hicks, M. Esters, N. Hotz, X. Campilongo, A. Calzolari and S. Curtarolo,High Entropy Alloys & Materials, 2025, 3, 178–187

  13. [14]

    C. Oses, M. Esters, D. Hicks, S. Divilov, H. Eckert, R. Friedrich, M. J. Mehl, A. Smolyanyuk, X. Campi- longo, A. van de Walle, J. Schroers, A. G. Kusne, I. Takeuchi, E. Zurek, M. B. Nardelli, M. Fornari, Y. Lederer, O. Levy, C. Toher and S. Curtarolo,Com- putational Materials Science, 2023,217, 111889

  14. [15]

    Zagorac, H

    D. Zagorac, H. Müller, S. Ruehl, J. Zagorac and S. Rehme,Journal of Applied Crystallography, 2019, 52, 918–925

  15. [16]

    R.AllmannandR.Hinek,Acta Crystallographica Sec- tion A, 2007,63, 412–417

  16. [17]

    Belsky, M

    A. Belsky, M. Hellenbrandt, V. L. Karen and P. Luksch,Acta Crystallographica Section B, 2002, 58, 364–369

  17. [18]

    J.-W. Lee, W. B. Park, J. H. Lee, S. P. Singh and K.-S. Sohn,Nature Communications, 2020,11, 86

  18. [19]

    B. Cao, Z. Zheng, Y. Liu, L. Zhang, L. W.-Y. Wong, L.-T. Weng, J. Li, H. Li and T.-Y. Zhang,National Science Review, 2025,12, nwaf421

  19. [20]

    Xie and J

    T. Xie and J. C. Grossman,Phys. Rev. Lett., 2018, 120, 145301

  20. [21]

    Choudhary and B

    K. Choudhary and B. DeCost,npj Computational Materials, 2021,7, 185

  21. [22]

    R. E. A. Goodall and A. A. Lee,Nature Communica- tions, 2020,11, 6280

  22. [23]

    D. Jha, L. Ward, A. Paul, W.-k. Liao, A. Choud- hary, C. Wolverton and A. Agrawal,Scientific Re- ports, 2018,8, 17593. 16

  23. [24]

    Isayev, C

    O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo and A. Tropsha,Nature Communications, 2017,8, 15679

  24. [25]

    Y. Dan, Y. Zhao, X. Li, S. Li, M. Hu and J. Hu,npj Computational Materials, 2020,6, 84

  25. [26]

    J. Noh, J. Kim, H. S. Stein, B. Sanchez-Lengeling, J. M. Gregoire, A. Aspuru-Guzik and Y. Jung,Mat- ter, 2019,1, 1370–1384

  26. [27]

    H. Xiao, R. Li, X. Shi, Y. Chen, L. Zhu, X. Chen and L. Wang,Nature Communications, 2023,14, 7027

  27. [28]

    C. Zeni, R. Pinsler, D. Zügner, A. Fowler, M. Horton, X. Fu, Z. Wang, A. Shysheya, J. Crabbé, S. Ueda, R. Sordillo, L. Sun, J. Smith, B. Nguyen, H. Schulz, S. Lewis, C.-W. Huang, Z. Lu, Y. Zhou, H. Yang, H. Hao, J. Li, C. Yang, W. Li, R. Tomioka and T. Xie, Nature, 2025,639, 624–632

  28. [29]

    L. Ward, A. Agrawal, A. Choudhary and C. Wolver- ton,npj Computational Materials, 2016,2, 16028

  29. [30]

    A. G. Kusne, H. Yu, C. Wu, H. Zhang, J. Hattrick- Simpers, B. DeCost, S. Sarker, C. Oses, C. Toher, S. Curtarolo, A. V. Davydov, R. Agarwal, L. A. Ben- dersky, M.Li, A.MehtaandI.Takeuchi,Nature Com- munications, 2020,11, 5966

  30. [31]

    N. J. Szymanski, B. Rendy, Y. Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y. Zeng and G. Ceder,Nature, 2023,624, 86–91

  31. [32]

    E. A. Pogue, A. New, K. McElroy, N. Q. Le, M. J. Pekala, I. McCue, E. Gienger, J. Domenico, E. Hedrick, T. M. McQueen, B. Wilfong, C. D. Pi- atko, C. R. Ratto, A. Lennon, C. Chung, T. Montal- bano, G. Bassen and C. D. Stiles,npj Computational Materials, 2023,9, 181

  32. [33]

    Palizhati, S

    A. Palizhati, S. B. Torrisi, M. Aykol, S. K. Suram, J. S. Hummelshøj and J. H. Montoya,Scientific Re- ports, 2022,12, 4694

  33. [34]

    D. Xue, D. Xue, R. Yuan, Y. Zhou, P. V. Balachan- dran, X. Ding, J. Sun and T. Lookman,Acta Materi- alia, 2017,125, 532–541

  34. [35]

    D. Xue, P. V. Balachandran, J. Hogden, J. Theiler, D. Xue and T. Lookman,Nature Communications, 2016,7, 11241

  35. [36]

    D.Xue, P.V.Balachandran, R.Yuan, T.Hu, X.Qian, E. R. Dougherty and T. Lookman,Proceedings of the National Academy of Sciences, 2016,113, 13301– 13306

  36. [37]

    H. Wang, R. E. Castañeda, J. R. Werber, Y. Fehlis, E. Kim and J. Hattrick-Simpers,Training-Free Ac- tive Learning Framework in Materials Science with Large Language Models, 2025,https://arxiv.org/ abs/2511.19730

  37. [38]

    R. Xin, E. M. D. Siriwardane, Y. Song, Y. Zhao, S.-Y. Louis, A. Nasiri and J. Hu,The Journal of Physical Chemistry C, 2021,125, 16118–16128

  38. [39]

    Rebuffi, S

    L. Rebuffi, S. Kandel, X. Shi, R. Zhang, R. J. Harder, W. Cha, M. J. Highland, M. G. Frith, L. Assoufid and M. J. Cherukara,Opt. Express, 2023,31, 39514– 39527

  39. [40]

    K. Li, D. Persaud, K. Choudhary, B. DeCost, M. Greenwood and J. Hattrick-Simpers,Nature Com- munications, 2023,14, 7283

  40. [41]

    Chanussot, A

    L. Chanussot, A. Das, S. Goyal, T. Lavril, M. Shuaibi, M. Riviere, K. Tran, J. Heras-Domingo, C. Ho, W. Hu, A. Palizhati, A. Sriram, B. Wood, J. Yoon, D. Parikh, C. L. Zitnick and Z. Ulissi,ACS Catalysis, 2021,11, 13062–13065

  41. [42]

    R. Tran, J. Lan, M. Shuaibi, B. M. Wood, S. Goyal, A. Das, J. Heras-Domingo, A. Kolluru, A. Rizvi, N. Shoghi, A. Sriram, F. Therrien, J. Abed, O. Voznyy, E. H. Sargent, Z. Ulissi and C. L. Zitnick, ACS Catalysis, 2023,13, 3066–3084

  42. [43]

    L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary and C. Wolverton,Phys. Rev. B, 2017, 96, 024104

  43. [44]

    L. Ward, A. Dunn, A. Faghaninia, N. E. Zimmer- mann, S. Bajaj, Q. Wang, J. Montoya, J. Chen, K. Bystrom, M. Dylla, K. Chard, M. Asta, K. A. Persson, G. J. Snyder, I. Foster and A. Jain,Compu- tational Materials Science, 2018,152, 60–69

  44. [45]

    L. Tang, L. Purdy, T. Mohanty, L. Ng and T. Sparks, Systematically Verified Experimental Thermoelectric Dataset For Data-driven Approaches, 2025

  45. [46]

    Grinsztajn, E

    L. Grinsztajn, E. Oyallon and G. Varoquaux, Pro- ceedings of the 36th International Conference on Neu- ral Information Processing Systems, Red Hook, NY, USA, 2022

  46. [47]

    K. Deb, A. Pratap, S. Agarwal and T. Meyarivan, IEEE Transactions on Evolutionary Computation, 2002,6, 182–197

  47. [48]

    Eriksson, M

    D. Eriksson, M. Pearce, J. R. Gardner, R. Turner and M. Poloczek,Scalable Global Optimization via Local Bayesian Optimization, 2020,https://arxiv.org/ abs/1910.01739. 17

  48. [49]

    N. Maus, K. Wu, D. Eriksson and J. Gardner, Discovering Many Diverse Solutions with Bayesian Optimization, 2023,https://arxiv.org/abs/2210. 10953

  49. [50]

    H. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann and A. Krause, Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, New York, NY, USA, 2015, p. 1195–1204

  50. [51]

    Nguyen and A

    Q. Nguyen and A. B. Dieng,Quality-Weighted Vendi Scores And Their Application To Diverse Experimen- tal Design, 2024,https://arxiv.org/abs/2405. 02449

  51. [52]

    Malkomes, B

    G. Malkomes, B. Cheng, E. H. Lee and M. Mccourt, Proceedings of the 38th International Conference on Machine Learning, 2021, pp. 7423–7434

  52. [53]

    E. Nava, M. Mutný and A. Krause,Diversified Sam- pling for Batched Bayesian Optimization with De- terminantal Point Processes, 2022,https://arxiv. org/abs/2110.11665

  53. [54]

    C. Gong, J. Peng and Q. Liu, Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 2347–2356

  54. [55]

    E. Nava, M. Mutny and A. Krause, Proceedings of The 25th International Conference on Artificial Intel- ligence and Statistics, 2022, pp. 7031–7054

  55. [56]

    Kathuria, A

    T. Kathuria, A. Deshpande and P. Kohli, Proceedings of the 30th International Conference on Neural In- formation Processing Systems, Red Hook, NY, USA, 2016, p. 4213–4221

  56. [57]

    Shibukawa, S

    R. Shibukawa, S. Matsuda, K. Nakamura, R. Tamura and K. Tsuda,npj Computational Materials, 2026

  57. [58]

    Couperthwaite, A

    R. Couperthwaite, A. Molkeri, D. Khatamsaz, A. Sri- vastava, D. Allaire and R. Arròyave,JOM, 2020,72, 4431–4443

  58. [59]

    Wilson, D

    N. Wilson, D. Willhelm, X. Qian, R. Arróyave and X. Qian,Computational Materials Science, 2022, 208, 111330

  59. [60]

    Hastings, M

    T. Hastings, M. Mulukutla, D. Khatamsaz, D. Salas, W. Xu, D. Lewis, N. Person, M. Skokan, B. Miller, J. Paramore, B. Butler, D. Allaire, V. Attari, I. Kara- man, G. Pharr, A. Srivastava and R. Arróyave,Acta Materialia, 2025,297, 121173

  60. [61]

    S. M. A. A. Alvi, B. Vela, V. Attari, J. Janssen, D. Perez, D. Allaire and R. Arróyave,npj Compu- tational Materials, 2026,12, 105

  61. [62]

    Wei and E.-c

    L.-s. Wei and E.-c. Li,Journal of Computational De- sign and Engineering, 2023,10, 1988–2018

  62. [63]

    M. Li, S. Yang and X. Liu,Artificial Intelligence, 2015,228, 45–65

  63. [64]

    The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

    D. Friedman and A. B. Dieng,The Vendi Score: A Diversity Evaluation Metric for Machine Learning, 2023,https://arxiv.org/abs/2210.02410

  64. [65]

    K. S. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, Proceedings of the 7th International Con- ference on Database Theory, Berlin, Heidelberg, 1999, p. 217–235

  65. [66]

    C. C. Aggarwal, A. Hinneburg and D. A. Keim, Proceedings of the 8th International Conference on Database Theory, Berlin, Heidelberg, 2001, p. 420–434

  66. [67]

    Radovanović, A

    M. Radovanović, A. Nanopoulos and M. Ivanović, Proceedings of the 26th Annual International Con- ference on Machine Learning, New York, NY, USA, 2009, p. 865–872

  67. [68]

    B. T. Mamillapalli, M. R. Kochi and S. M. Moosavi, AI for Accelerated Materials Design - ICLR 2026, 2026. 18 A Appendix Figure 10:Pearson correlation of outcomes variables of DFT datasets. Figure 11:Pearson correlation of outcomes variables of the sysTEm experimental dataset. 19 Figure 12:Outcomes distribution shown in the first two PCs of feature space. ...