arxiv: 2605.05104 · v1 · submitted 2026-05-06 · ❄️ cond-mat.mtrl-sci · cs.AI· cs.DB· cs.LG· stat.AP

Recognition: unknown

Building informative materials datasets beyond targeted objectives

Rafael Espinosa Casta\~neda , Ashley Dale , Hongchen Wang , Yonatan Kurniawan , Hao Wan , Runze Zhang , Adji Bousso Dieng , Kangming Li

show 1 more author

Jason Hattrick-Simpers

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:21 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci cs.AIcs.DBcs.LGstat.AP

keywords materials sciencedataset constructiondiversity-aware selectiontargeted propertiesuntargeted propertiespredictive performancedata collection

0 comments

The pith

A diversity-aware framework constructs materials datasets that perform well on both targeted and untargeted properties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Materials researchers often collect data focused on particular properties of interest, but this focus can make the datasets less useful for other properties that arise in future work. The paper introduces a framework that uses diversity-aware selection during data collection to keep broad coverage of the materials space. This ensures that predictive models trained on the dataset maintain strong performance on untargeted properties, avoiding drops of up to 40 percent seen without it, while also boosting results for the targeted properties by up to 25 percent. The result is datasets that support a wider range of discovery tasks without requiring entirely new data campaigns.

Core claim

The central discovery is that incorporating diversity-aware selection into dataset construction maximizes informativeness for target properties of interest while preserving and enhancing performance on untargeted properties. In noisy experimental data, this leads to up to 10 percent improvement on untargeted properties and 25 percent on targeted ones relative to alternatives.

What carries the argument

Diversity-aware selection, a method that chooses materials data points to maximize coverage across the materials space while optimizing for specified targets.

If this is right

Prediction models for untargeted properties suffer less degradation or even improve compared to random sampling.
Datasets become reusable for future objectives without cold-start limitations.
Materials coverage increases, supporting unbiased data entries across outcomes.
Targeted property predictions gain up to 25 percent accuracy.
Overall, datasets mitigate limitations in subsequent modeling campaigns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method might allow labs to build one versatile dataset instead of many specialized ones, saving experimental costs.
Similar diversity approaches could apply to data collection in other fields like biology or chemistry experiments.
Validating the framework on additional noisy datasets would test how well the gains generalize.

Load-bearing premise

Diversity in the materials space can be quantified from available data without extra measurements and the performance gains hold for other datasets and models.

What would settle it

Construct a new materials dataset using the framework and compare model performance on untargeted properties to a random sample; no improvement would challenge the claim.

Figures

Figures reproduced from arXiv: 2605.05104 by Adji Bousso Dieng, Ashley Dale, Hao Wan, Hongchen Wang, Jason Hattrick-Simpers, Kangming Li, Rafael Espinosa Casta\~neda, Runze Zhang, Yonatan Kurniawan.

**Figure 1.** Figure 1: Overview of data selection pipeline for each iteration when explicitly incorporating diversity in feature space. The image shows how 1% of the pool data is selected. For each iteration, performance of the two models-Random Forest and XGBoost on hold-out test dataset is recorded. The pipeline is repeated until the complete pool data has been selected. plicitly account for diversity. The analysis is perform… view at source ↗

**Figure 2.** Figure 2: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool JARVIS18. sity mechanisms in constructing robust, broadly informative datasets. 3 Results In this section, we present representative results using the DFT and the sysTEM (experimental) datasets as material pools. These results illustrate the behavior of our fr… view at source ↗

**Figure 3.** Figure 3: Data Manifold coverage comparison without and with diversity for single target DFT dataset building. Notoriously, skewed distributions are obtained even with 50% of sampled data when unconsidering diversity. dataset (here ∼ 3.41%)—occurs for QBC bandgap sampling (Fig. 2a)). In this worst case scenario, RMSE saturation for bandgap prediction is achieved when approximately 40% of the pool is sampled using Q… view at source ↗

**Figure 4.** Figure 4: Random Forest improvement of all policies respect to random sampling with single outcome targeting in DFT datasets construction. lead up to ∼ 13% performance degradation respect to random sampling with Random Forests and up to ∼ 17% with XGBoost. Moreover, only three diversity-aware cases, evaluated with the Random Forest regressor, show standard deviation bars overlapping the random sampling mean: MP21 t… view at source ↗

**Figure 5.** Figure 5: Random Forest improvement of all policies respect to random sampling with single outcome targeting in experimental datasets construction. Using as pool the sysTEm dataset. On the other hand, the diversity aware performance on untargated properties can be below than random sampling ( view at source ↗

**Figure 6.** Figure 6: Data Manifold coverage comparison without and with diversity for single target with sysTEm dataset as pool. Global diversity and general coverage is noted without and with diversity. Nevertheless, within clusters without diversity tends to focus in specific regions while diversity aware spans evenly within clusters. ters. This mechanism is particularly important for the experimental dataset, where many mat… view at source ↗

**Figure 7.** Figure 7: Improvement of all policies respect to random sampling with two outcomes targeting in DFT datasets construction view at source ↗

**Figure 8.** Figure 8: Data Manifold coverage NSGA-II QBC vs NSGA-II QBC with feature diversity. Globally, both create diverse datasets. However, NSGA-II without diversity still creates skewed distributions even with 50% of sampled data. In all cases, feature diversity-aware sampling performs better than or comparably to random sampling while exploring the full chemical data manifold without producing skewed sampling distributi… view at source ↗

**Figure 9.** Figure 9: RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the targetd used for data construction are electrical conductivity, seebeck coefficient and thermal conductivity. ture, and κ thermal conductivity—does not improve predictive performance for zT in experimental data. This trend is clearly observed in Fig. 9a). For NSGA-II alone and NSGA-II … view at source ↗

**Figure 10.** Figure 10: Pearson correlation of outcomes variables of DFT datasets view at source ↗

**Figure 11.** Figure 11: Pearson correlation of outcomes variables of the sysTEm experimental dataset. 19 view at source ↗

**Figure 12.** Figure 12: Outcomes distribution shown in the first two PCs of feature space view at source ↗

**Figure 13.** Figure 13: Outcomes distribution shown in the first two PCs of feature space view at source ↗

**Figure 14.** Figure 14: Outcomes distribution shown in the first two PCs of feature space. 20 view at source ↗

**Figure 15.** Figure 15: Outcomes distribution shown in the first two PCs of feature space view at source ↗

**Figure 16.** Figure 16: Outcomes distribution shown in the first two PCs of feature space in the sysTEm dataset. 21 view at source ↗

**Figure 17.** Figure 17: XGBoost improvement of all policies respect to random sampling with single outcome targeting in experimental datasets construction view at source ↗

**Figure 18.** Figure 18: Random Forest Improvement of all policies respect to random sampling with two outcomes targeting in experimental datasets construction. Using as pool the sysTEm dataset. 22 view at source ↗

**Figure 19.** Figure 19: XGBoost improvement of all policies respect to random sampling with two outcomes targeting in experimental datasets construction view at source ↗

**Figure 20.** Figure 20: Random Forest improvement of all policies respect to random sampling with three outcomes targeting in experimental datasets construction. 23 view at source ↗

**Figure 21.** Figure 21: XGBoost improvement of all policies respect to random sampling with three outcomes targeting in experimental datasets construction. 24 view at source ↗

**Figure 22.** Figure 22: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Thermal Conductivity view at source ↗

**Figure 23.** Figure 23: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is zT. 25 view at source ↗

**Figure 24.** Figure 24: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is electrical conductivity view at source ↗

**Figure 25.** Figure 25: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Seebeck coefficient. A.2.2 Two targets Dataset Construction (performance metrics-Random Forest) 26 view at source ↗

**Figure 26.** Figure 26: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and Thermal Conductivity view at source ↗

**Figure 27.** Figure 27: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and zT. 27 view at source ↗

**Figure 28.** Figure 28: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and Thermal Conductivity view at source ↗

**Figure 29.** Figure 29: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and zT. 28 view at source ↗

**Figure 30.** Figure 30: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are zT and Thermal Conductivity. 29 view at source ↗

**Figure 31.** Figure 31: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Seebeck Coefficient and zT. 30 view at source ↗

**Figure 32.** Figure 32: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, zT and Thermal Conductivity view at source ↗

**Figure 33.** Figure 33: Random Forest RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient, zT and Thermal Conductivity. 31 view at source ↗

**Figure 34.** Figure 34: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Thermal Conductivity view at source ↗

**Figure 35.** Figure 35: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is zT. 32 view at source ↗

**Figure 36.** Figure 36: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Electrical Conductivity view at source ↗

**Figure 37.** Figure 37: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction is Seebeck Coefficient. A.2.5 Two targets Dataset Construction (performance metrics-XGBoost) 33 view at source ↗

**Figure 38.** Figure 38: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and Thermal Conductivity view at source ↗

**Figure 39.** Figure 39: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and Thermal Conductivity. 34 view at source ↗

**Figure 40.** Figure 40: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity and zT view at source ↗

**Figure 41.** Figure 41: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and Thermal Conductivity. 35 view at source ↗

**Figure 42.** Figure 42: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Seebeck Coefficient and zT view at source ↗

**Figure 43.** Figure 43: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are zT and Thermal Conductivity. A.2.6 Three targets Dataset Construction (performance metrics-XGBoost) 36 view at source ↗

**Figure 44.** Figure 44: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Seebeck Coefficient and Thermal Conductivity view at source ↗

**Figure 45.** Figure 45: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Seebeck Coefficient and zT. 37 view at source ↗

**Figure 46.** Figure 46: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Thermal Conductivity and zT view at source ↗

**Figure 47.** Figure 47: XGBoost RMSE curves on hold out test data for Thermal Conductivity, zT, Seebeck Coefficient and Electrical conductivity when the target used for data construction are Electrical Conductivity, Thermal Conductivity and zT. 38 view at source ↗

**Figure 48.** Figure 48: XGBoost improvement of all policies respect to random sampling with single outcome targeting in DFT datasets construction view at source ↗

**Figure 49.** Figure 49: XGBoost improvement of all policies respect to random sampling with two outcomes targeting in DFT datasets construction. C JARVIS 18 C.1 Single Target Dataset Construction (performance metrics-Random Forest) 39 view at source ↗

**Figure 50.** Figure 50: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool JARVIS18 view at source ↗

**Figure 51.** Figure 51: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool JARVIS18. 40 view at source ↗

**Figure 52.** Figure 52: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS18. 41 view at source ↗

**Figure 53.** Figure 53: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS18 view at source ↗

**Figure 54.** Figure 54: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS18. 42 view at source ↗

**Figure 55.** Figure 55: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP21 view at source ↗

**Figure 56.** Figure 56: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP21. 43 view at source ↗

**Figure 57.** Figure 57: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP21. C.4 Two Targets Dataset Construction (performance metrics-XGBoost) view at source ↗

**Figure 58.** Figure 58: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS18. 44 view at source ↗

**Figure 59.** Figure 59: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS18 view at source ↗

**Figure 60.** Figure 60: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS18. 45 view at source ↗

**Figure 64.** Figure 64: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool JARVIS22 view at source ↗

**Figure 65.** Figure 65: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool JARVIS22. 50 view at source ↗

**Figure 66.** Figure 66: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool JARVIS22. D.2 Two targets Dataset Construction (performance metrics-Random Forest) view at source ↗

**Figure 67.** Figure 67: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS22. 51 view at source ↗

**Figure 68.** Figure 68: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS22 view at source ↗

**Figure 69.** Figure 69: RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS22. 52 view at source ↗

**Figure 70.** Figure 70: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool JARVIS22 view at source ↗

**Figure 71.** Figure 71: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool JARVIS22. 53 view at source ↗

**Figure 72.** Figure 72: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool JARVIS22. D.4 Two targets Dataset Construction (performance metrics-XGBoost) view at source ↗

**Figure 73.** Figure 73: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool JARVIS22. 54 view at source ↗

**Figure 74.** Figure 74: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool JARVIS22 view at source ↗

**Figure 75.** Figure 75: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool JARVIS22. 55 view at source ↗

**Figure 79.** Figure 79: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP18. 57 view at source ↗

**Figure 80.** Figure 80: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP18 view at source ↗

**Figure 81.** Figure 81: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP18. 58 view at source ↗

**Figure 82.** Figure 82: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool MP18. 59 view at source ↗

**Figure 83.** Figure 83: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool MP18 view at source ↗

**Figure 84.** Figure 84: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool MP18. 60 view at source ↗

**Figure 85.** Figure 85: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP18 view at source ↗

**Figure 86.** Figure 86: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP18. 61 view at source ↗

**Figure 87.** Figure 87: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP18. E.4 Two Targets Dataset Construction (performance metrics-XGBoost) view at source ↗

**Figure 88.** Figure 88: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and bulk modulus using as pool MP18. 62 view at source ↗

**Figure 89.** Figure 89: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are bandgap and formation energy using as pool MP18 view at source ↗

**Figure 90.** Figure 90: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the targets used for data construction are formation energy and bulk modulus using as pool MP18. 63 view at source ↗

**Figure 94.** Figure 94: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP21. 65 view at source ↗

**Figure 95.** Figure 95: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP21 view at source ↗

**Figure 96.** Figure 96: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP21. 66 view at source ↗

**Figure 97.** Figure 97: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and bulk modulus using as pool MP21. 67 view at source ↗

**Figure 98.** Figure 98: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and formation energy using as pool MP21 view at source ↗

**Figure 99.** Figure 99: Random Forest RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are formation energy and bulk modulus using as pool MP21. 68 view at source ↗

**Figure 100.** Figure 100: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bandgap using as pool MP21 view at source ↗

**Figure 101.** Figure 101: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is bulk modulus using as pool MP21. 69 view at source ↗

**Figure 102.** Figure 102: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction is formation energy using as pool MP21. F.4 Two targets Dataset Construction (performance metrics-XGBoost) view at source ↗

**Figure 103.** Figure 103: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and bulk modulus using as pool MP21. 70 view at source ↗

**Figure 104.** Figure 104: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are bandgap and formation energy using as pool MP21 view at source ↗

**Figure 105.** Figure 105: XGBoost RMSE curves on hold out test data for bandgap, bulk modulus and formation energy when the target used for data construction are formation energy and bulk modulus using as pool MP21. 71 view at source ↗

read the original abstract

Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a diversity-aware selection method for materials datasets to keep targeted predictions strong while avoiding big drops on untargeted properties, but the quantitative claims need far more methodological detail to hold up.

read the letter

The main takeaway is that this work tries to fix a real workflow problem in materials science: when you collect data for one set of properties, you often end up with datasets that are useless for anything else later. Their framework adds a diversity step during selection to maintain broader coverage without extra measurements on the untargeted properties. That framing is the clearest new angle here, and it directly tackles the cold-start and reusability issues that come up in expensive experimental campaigns. The reported numbers—up to 40% degradation on untargeted properties without diversity, and gains of 10% with it, plus 25% on targeted—sound like they come from actual runs on noisy data, which is better than pure theory. Credit for trying to quantify the trade-off instead of just asserting it exists. The approach seems grounded in practical dataset construction rather than abstract ML theory. That said, the abstract gives almost no information on the diversity algorithm itself, the input features used for selection, the baselines, the models, error bars, or how they handled statistical significance. Without those, the performance deltas are hard to interpret or reproduce. The stress-test concern lands: if the diversity metric only sees composition or targeted descriptors, it may not actually protect coverage for properties that are weakly correlated or orthogonal. The paper would need to show that the method works on real experimental noise without hidden proxies or extra labels. This is the kind of thing that matters for materials informatics groups who run their own data collection loops. A reader already working on active learning or dataset curation for discovery would get value from seeing the concrete numbers and the proposed fix, even if they end up adapting the idea. It is not ready to cite yet because the evidence is still thin on the mechanics. It does deserve a serious referee pass so the authors can supply the missing algorithm details, full experimental setup, and checks on whether the diversity step truly generalizes beyond the tested cases.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes a diversity-aware framework for constructing materials datasets that prioritizes informativeness for targeted properties while preserving utility for untargeted ones. Using experiments on noisy experimental data, it claims that omitting diversity leads to performance degradations of up to 40% on untargeted properties and 12.5% on targeted properties relative to random sampling, whereas the framework yields gains of up to 10% and 25%, respectively, ensuring broader materials coverage for future tasks.

Significance. If substantiated, the work addresses a practical challenge in materials informatics by promoting dataset construction practices that avoid narrow focus and support long-term reuse in machine learning-driven discovery. The empirical demonstration of performance preservation across targeted and untargeted outcomes could inform data collection strategies in resource-constrained experimental settings.

major comments (3)

[Abstract] Abstract: The specific quantitative claims (degradation up to 40% without diversity and gains up to 25% with it) are presented without any description of the diversity selection algorithm, the input features or descriptors used for diversity computation, the experimental datasets, ML models, baselines, error bars, or statistical tests, rendering the central empirical results unverifiable.
[Abstract] Abstract: The framework's core assumption—that diversity computed from targeted measurements or input features alone suffices to ensure coverage relevant to untargeted properties without extra measurements—is not validated; if the chosen descriptors are orthogonal to certain untargeted outcomes or if experimental noise affects the diversity metric, the reported preservation effect may not hold.
[Abstract] Abstract: No details are given on how diversity-aware selection is implemented for real noisy experimental data (e.g., exact diversity metric, selection procedure, or handling of label noise), which is load-bearing for the claim that the approach avoids costly additional measurements while achieving the stated gains.

minor comments (3)

[Abstract] Typo: 'critical important' should read 'critically important'.
[Abstract] Grammatical issue: 'potentially generate datasets poorly suited' should be rephrased for subject-verb agreement and clarity (e.g., 'potentially generates datasets that are poorly suited').
[Abstract] The abstract would benefit from a concise statement of the diversity metric or algorithm to allow readers to assess reproducibility of the reported improvements.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful review and constructive comments on the abstract. We agree that the abstract's brevity has limited verifiability of the central claims and will revise it to include brief descriptions of the method, data, and models. We also expand the main text to address assumptions and implementation details more explicitly. These changes strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: The specific quantitative claims (degradation up to 40% without diversity and gains up to 25% with it) are presented without any description of the diversity selection algorithm, the input features or descriptors used for diversity computation, the experimental datasets, ML models, baselines, error bars, or statistical tests, rendering the central empirical results unverifiable.

Authors: We acknowledge that the abstract omits these details due to length constraints. The diversity selection algorithm (greedy max-min diversity on compositional descriptors) is described in Section 3.1, input features are elemental and structural descriptors from the Materials Project, experimental datasets are noisy real-world measurements detailed in Section 4, ML models are random forests with random sampling and targeted-only selection as baselines, error bars are from 5-fold cross-validation, and statistical significance is assessed via paired t-tests (p < 0.05). We will revise the abstract to add: 'using greedy diversity selection on compositional descriptors from noisy experimental data, evaluated with random forest models and cross-validation.' Full details remain in the main text. revision: yes
Referee: [Abstract] Abstract: The framework's core assumption—that diversity computed from targeted measurements or input features alone suffices to ensure coverage relevant to untargeted properties without extra measurements—is not validated; if the chosen descriptors are orthogonal to certain untargeted outcomes or if experimental noise affects the diversity metric, the reported preservation effect may not hold.

Authors: The assumption is supported by empirical results across multiple untargeted properties in noisy experimental settings, where diversity on targeted/input features yields up to 10% gains. We agree that complete orthogonality or severe noise could weaken the effect and will add a limitations paragraph in the Discussion section noting these edge cases and mitigation strategies (e.g., robust covariance-based metrics). The current experiments validate the approach for the properties and noise levels considered, without requiring extra measurements. revision: partial
Referee: [Abstract] Abstract: No details are given on how diversity-aware selection is implemented for real noisy experimental data (e.g., exact diversity metric, selection procedure, or handling of label noise), which is load-bearing for the claim that the approach avoids costly additional measurements while achieving the stated gains.

Authors: These details appear in Section 3.2: the diversity metric is the determinant of the feature covariance matrix, implemented via greedy batch selection, with label noise handled through replicate averaging and robust model training. This enables use of existing noisy data without additional measurements. We will update the abstract to include: 'via greedy covariance-based diversity selection on noisy experimental data.' revision: yes

Circularity Check

0 steps flagged

No circularity; empirical performance results are independent measurements

full rationale

The paper describes an empirical framework for diversity-aware dataset construction in materials science and reports quantitative performance outcomes (e.g., up to 40% degradation without diversity, up to 10% gains with it on untargeted properties) obtained by applying the method to experimental datasets. These are presented as measured results from experiments rather than definitions, fitted parameters renamed as predictions, or reductions via self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that collapse the central claims back to the inputs by construction. The derivation chain consists of standard selection procedures evaluated on held-out data, making the findings self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that diversity-aware selection in materials space will yield the stated performance gains on both targeted and untargeted properties in noisy data; no free parameters, new entities, or additional axioms are identifiable from the abstract alone.

axioms (1)

domain assumption Diversity-aware selection in materials feature space ensures broad coverage that preserves and improves performance on untargeted properties
This assumption is invoked to explain why the framework avoids degradation on untargeted outcomes and supports future objectives.

pith-pipeline@v0.9.0 · 5559 in / 1478 out tokens · 53375 ms · 2026-05-08T16:21:41.644358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 4 canonical work pages

[1]

J. Liu, Z. Wang, J. Kou and K. Chen,Photon Science, 2026,1, 91–103

2026
[2]

J. S. Solomon, N. Mrkyvkova, V. Kliner, T. Soto- Montero, I. Fernandez-Guillen, M. Ledinský, P. P. Boix, P. Siffalovic and M. Morales-Masis,npj 2D Ma- terials and Applications, 2025,9, 50

2025
[3]

G. W. Lee, A. K. Gangopadhyay, K. F. Kelton, R. W. Hyers, T. J. Rathz, J. R. Rogers and D. S. Robinson, Physical review letters, 2004,93, 037802

2004
[4]

Mascagni, N

M. Mascagni, N. Tchipev, S. Seckler, M. Heinen, J. Vrabec, F. Gratl, M. Horsch, M. Bern- reuther, C. W. Glass, C. Niethammer, N. Hammer, B. Krischok, M. Resch, D. Kranzlmüller, H. Hasse, H.-J. Bungartz and P. Neumann,Int. J. High Per- form. Comput. Appl., 2019,33, 838–854

2019
[5]

KADAU, T

K. KADAU, T. C. GERMANN and P. S. LOMDAHL, International Journal of Modern Physics C, 2006,17, 1755–1761

2006
[6]

Wines and K

D. Wines and K. Choudhary,Materials Futures, 2024, 3, 025602

2024
[8]

A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder and K. A. Persson,APL Materials, 2013,1, 011002

2013
[9]

Choudhary, K

K. Choudhary, K. F. Garrity, A. C. E. Reid, B. De- Cost, A. J. Biacchi, A. R. Hight Walker, Z. Trautt, J. Hattrick-Simpers, A. G. Kusne, A. Centrone, A. Davydov, J. Jiang, R. Pachter, G. Cheon, E. Reed, A. Agrawal, X. Qian, V. Sharma, H. Zhuang, S. V. Kalinin, B. G. Sumpter, G. Pilania, P. Acar, S. Man- dal, K.Haule, D.Vanderbilt, K.RabeandF.Tavazza, npj ...

2020
[10]

Kirklin, J

S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. Rühl and C. Wolverton, npj Computational Materials, 2015,1, 15010

2015
[11]

J. Shen, S. D. Griesemer, A. Gopakumar, B. Baldas- sarri, J. E. Saal, M. Aykol, V. I. Hegde and C. Wolver- ton,Journal of Physics: Materials, 2022,5, 031001

2022
[12]

J. E. Saal, S. Kirklin, M. Aykol, B. Meredig and C. Wolverton,JOM, 2013,65, 1501–1509

2013
[13]

Divilov, H

S. Divilov, H. Eckert, S. D. Thiel, S. D. Griesemer, R. Friedrich, N. H. Anderson, M. J. Mehl, D. Hicks, M. Esters, N. Hotz, X. Campilongo, A. Calzolari and S. Curtarolo,High Entropy Alloys & Materials, 2025, 3, 178–187

2025
[14]

C. Oses, M. Esters, D. Hicks, S. Divilov, H. Eckert, R. Friedrich, M. J. Mehl, A. Smolyanyuk, X. Campi- longo, A. van de Walle, J. Schroers, A. G. Kusne, I. Takeuchi, E. Zurek, M. B. Nardelli, M. Fornari, Y. Lederer, O. Levy, C. Toher and S. Curtarolo,Com- putational Materials Science, 2023,217, 111889

2023
[15]

Zagorac, H

D. Zagorac, H. Müller, S. Ruehl, J. Zagorac and S. Rehme,Journal of Applied Crystallography, 2019, 52, 918–925

2019
[16]

R.AllmannandR.Hinek,Acta Crystallographica Sec- tion A, 2007,63, 412–417

2007
[17]

Belsky, M

A. Belsky, M. Hellenbrandt, V. L. Karen and P. Luksch,Acta Crystallographica Section B, 2002, 58, 364–369

2002
[18]

J.-W. Lee, W. B. Park, J. H. Lee, S. P. Singh and K.-S. Sohn,Nature Communications, 2020,11, 86

2020
[19]

B. Cao, Z. Zheng, Y. Liu, L. Zhang, L. W.-Y. Wong, L.-T. Weng, J. Li, H. Li and T.-Y. Zhang,National Science Review, 2025,12, nwaf421

2025
[20]

Xie and J

T. Xie and J. C. Grossman,Phys. Rev. Lett., 2018, 120, 145301

2018
[21]

Choudhary and B

K. Choudhary and B. DeCost,npj Computational Materials, 2021,7, 185

2021
[22]

R. E. A. Goodall and A. A. Lee,Nature Communica- tions, 2020,11, 6280

2020
[23]

D. Jha, L. Ward, A. Paul, W.-k. Liao, A. Choud- hary, C. Wolverton and A. Agrawal,Scientific Re- ports, 2018,8, 17593. 16

2018
[24]

Isayev, C

O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo and A. Tropsha,Nature Communications, 2017,8, 15679

2017
[25]

Y. Dan, Y. Zhao, X. Li, S. Li, M. Hu and J. Hu,npj Computational Materials, 2020,6, 84

2020
[26]

J. Noh, J. Kim, H. S. Stein, B. Sanchez-Lengeling, J. M. Gregoire, A. Aspuru-Guzik and Y. Jung,Mat- ter, 2019,1, 1370–1384

2019
[27]

H. Xiao, R. Li, X. Shi, Y. Chen, L. Zhu, X. Chen and L. Wang,Nature Communications, 2023,14, 7027

2023
[28]

C. Zeni, R. Pinsler, D. Zügner, A. Fowler, M. Horton, X. Fu, Z. Wang, A. Shysheya, J. Crabbé, S. Ueda, R. Sordillo, L. Sun, J. Smith, B. Nguyen, H. Schulz, S. Lewis, C.-W. Huang, Z. Lu, Y. Zhou, H. Yang, H. Hao, J. Li, C. Yang, W. Li, R. Tomioka and T. Xie, Nature, 2025,639, 624–632

2025
[29]

L. Ward, A. Agrawal, A. Choudhary and C. Wolver- ton,npj Computational Materials, 2016,2, 16028

2016
[30]

A. G. Kusne, H. Yu, C. Wu, H. Zhang, J. Hattrick- Simpers, B. DeCost, S. Sarker, C. Oses, C. Toher, S. Curtarolo, A. V. Davydov, R. Agarwal, L. A. Ben- dersky, M.Li, A.MehtaandI.Takeuchi,Nature Com- munications, 2020,11, 5966

2020
[31]

N. J. Szymanski, B. Rendy, Y. Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y. Zeng and G. Ceder,Nature, 2023,624, 86–91

2023
[32]

E. A. Pogue, A. New, K. McElroy, N. Q. Le, M. J. Pekala, I. McCue, E. Gienger, J. Domenico, E. Hedrick, T. M. McQueen, B. Wilfong, C. D. Pi- atko, C. R. Ratto, A. Lennon, C. Chung, T. Montal- bano, G. Bassen and C. D. Stiles,npj Computational Materials, 2023,9, 181

2023
[33]

Palizhati, S

A. Palizhati, S. B. Torrisi, M. Aykol, S. K. Suram, J. S. Hummelshøj and J. H. Montoya,Scientific Re- ports, 2022,12, 4694

2022
[34]

D. Xue, D. Xue, R. Yuan, Y. Zhou, P. V. Balachan- dran, X. Ding, J. Sun and T. Lookman,Acta Materi- alia, 2017,125, 532–541

2017
[35]

D. Xue, P. V. Balachandran, J. Hogden, J. Theiler, D. Xue and T. Lookman,Nature Communications, 2016,7, 11241

2016
[36]

D.Xue, P.V.Balachandran, R.Yuan, T.Hu, X.Qian, E. R. Dougherty and T. Lookman,Proceedings of the National Academy of Sciences, 2016,113, 13301– 13306

2016
[37]

H. Wang, R. E. Castañeda, J. R. Werber, Y. Fehlis, E. Kim and J. Hattrick-Simpers,Training-Free Ac- tive Learning Framework in Materials Science with Large Language Models, 2025,https://arxiv.org/ abs/2511.19730

work page arXiv 2025
[38]

R. Xin, E. M. D. Siriwardane, Y. Song, Y. Zhao, S.-Y. Louis, A. Nasiri and J. Hu,The Journal of Physical Chemistry C, 2021,125, 16118–16128

2021
[39]

Rebuffi, S

L. Rebuffi, S. Kandel, X. Shi, R. Zhang, R. J. Harder, W. Cha, M. J. Highland, M. G. Frith, L. Assoufid and M. J. Cherukara,Opt. Express, 2023,31, 39514– 39527

2023
[40]

K. Li, D. Persaud, K. Choudhary, B. DeCost, M. Greenwood and J. Hattrick-Simpers,Nature Com- munications, 2023,14, 7283

2023
[41]

Chanussot, A

L. Chanussot, A. Das, S. Goyal, T. Lavril, M. Shuaibi, M. Riviere, K. Tran, J. Heras-Domingo, C. Ho, W. Hu, A. Palizhati, A. Sriram, B. Wood, J. Yoon, D. Parikh, C. L. Zitnick and Z. Ulissi,ACS Catalysis, 2021,11, 13062–13065

2021
[42]

R. Tran, J. Lan, M. Shuaibi, B. M. Wood, S. Goyal, A. Das, J. Heras-Domingo, A. Kolluru, A. Rizvi, N. Shoghi, A. Sriram, F. Therrien, J. Abed, O. Voznyy, E. H. Sargent, Z. Ulissi and C. L. Zitnick, ACS Catalysis, 2023,13, 3066–3084

2023
[43]

L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary and C. Wolverton,Phys. Rev. B, 2017, 96, 024104

2017
[44]

L. Ward, A. Dunn, A. Faghaninia, N. E. Zimmer- mann, S. Bajaj, Q. Wang, J. Montoya, J. Chen, K. Bystrom, M. Dylla, K. Chard, M. Asta, K. A. Persson, G. J. Snyder, I. Foster and A. Jain,Compu- tational Materials Science, 2018,152, 60–69

2018
[45]

L. Tang, L. Purdy, T. Mohanty, L. Ng and T. Sparks, Systematically Verified Experimental Thermoelectric Dataset For Data-driven Approaches, 2025

2025
[46]

Grinsztajn, E

L. Grinsztajn, E. Oyallon and G. Varoquaux, Pro- ceedings of the 36th International Conference on Neu- ral Information Processing Systems, Red Hook, NY, USA, 2022

2022
[47]

K. Deb, A. Pratap, S. Agarwal and T. Meyarivan, IEEE Transactions on Evolutionary Computation, 2002,6, 182–197

2002
[48]

Eriksson, M

D. Eriksson, M. Pearce, J. R. Gardner, R. Turner and M. Poloczek,Scalable Global Optimization via Local Bayesian Optimization, 2020,https://arxiv.org/ abs/1910.01739. 17

work page arXiv 2020
[49]

N. Maus, K. Wu, D. Eriksson and J. Gardner, Discovering Many Diverse Solutions with Bayesian Optimization, 2023,https://arxiv.org/abs/2210. 10953

2023
[50]

H. P. Vanchinathan, A. Marfurt, C.-A. Robelin, D. Kossmann and A. Krause, Proceedings of the 21th ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, New York, NY, USA, 2015, p. 1195–1204

2015
[51]

Nguyen and A

Q. Nguyen and A. B. Dieng,Quality-Weighted Vendi Scores And Their Application To Diverse Experimen- tal Design, 2024,https://arxiv.org/abs/2405. 02449

2024
[52]

Malkomes, B

G. Malkomes, B. Cheng, E. H. Lee and M. Mccourt, Proceedings of the 38th International Conference on Machine Learning, 2021, pp. 7423–7434

2021
[53]

E. Nava, M. Mutný and A. Krause,Diversified Sam- pling for Batched Bayesian Optimization with De- terminantal Point Processes, 2022,https://arxiv. org/abs/2110.11665

work page arXiv 2022
[54]

C. Gong, J. Peng and Q. Liu, Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 2347–2356

2019
[55]

E. Nava, M. Mutny and A. Krause, Proceedings of The 25th International Conference on Artificial Intel- ligence and Statistics, 2022, pp. 7031–7054

2022
[56]

Kathuria, A

T. Kathuria, A. Deshpande and P. Kohli, Proceedings of the 30th International Conference on Neural In- formation Processing Systems, Red Hook, NY, USA, 2016, p. 4213–4221

2016
[57]

Shibukawa, S

R. Shibukawa, S. Matsuda, K. Nakamura, R. Tamura and K. Tsuda,npj Computational Materials, 2026

2026
[58]

Couperthwaite, A

R. Couperthwaite, A. Molkeri, D. Khatamsaz, A. Sri- vastava, D. Allaire and R. Arròyave,JOM, 2020,72, 4431–4443

2020
[59]

Wilson, D

N. Wilson, D. Willhelm, X. Qian, R. Arróyave and X. Qian,Computational Materials Science, 2022, 208, 111330

2022
[60]

Hastings, M

T. Hastings, M. Mulukutla, D. Khatamsaz, D. Salas, W. Xu, D. Lewis, N. Person, M. Skokan, B. Miller, J. Paramore, B. Butler, D. Allaire, V. Attari, I. Kara- man, G. Pharr, A. Srivastava and R. Arróyave,Acta Materialia, 2025,297, 121173

2025
[61]

S. M. A. A. Alvi, B. Vela, V. Attari, J. Janssen, D. Perez, D. Allaire and R. Arróyave,npj Compu- tational Materials, 2026,12, 105

2026
[62]

Wei and E.-c

L.-s. Wei and E.-c. Li,Journal of Computational De- sign and Engineering, 2023,10, 1988–2018

2023
[63]

M. Li, S. Yang and X. Liu,Artificial Intelligence, 2015,228, 45–65

2015
[64]

The vendi score: A diversity evaluation metric for machine learning.arXiv preprint arXiv:2210.02410, 2022

D. Friedman and A. B. Dieng,The Vendi Score: A Diversity Evaluation Metric for Machine Learning, 2023,https://arxiv.org/abs/2210.02410

work page arXiv 2023
[65]

K. S. Beyer, J. Goldstein, R. Ramakrishnan and U. Shaft, Proceedings of the 7th International Con- ference on Database Theory, Berlin, Heidelberg, 1999, p. 217–235

1999
[66]

C. C. Aggarwal, A. Hinneburg and D. A. Keim, Proceedings of the 8th International Conference on Database Theory, Berlin, Heidelberg, 2001, p. 420–434

2001
[67]

Radovanović, A

M. Radovanović, A. Nanopoulos and M. Ivanović, Proceedings of the 26th Annual International Con- ference on Machine Learning, New York, NY, USA, 2009, p. 865–872

2009
[68]

B. T. Mamillapalli, M. R. Kochi and S. M. Moosavi, AI for Accelerated Materials Design - ICLR 2026, 2026. 18 A Appendix Figure 10:Pearson correlation of outcomes variables of DFT datasets. Figure 11:Pearson correlation of outcomes variables of the sysTEm experimental dataset. 19 Figure 12:Outcomes distribution shown in the first two PCs of feature space. ...

2026