Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models
Pith reviewed 2026-05-25 18:25 UTC · model grok-4.3
The pith
Alchemy supplies quantum properties for 119487 molecules to benchmark graph neural networks in chemistry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Alchemy comprises 12 quantum mechanical properties of 119487 organic molecules with up to 14 heavy atoms sampled from the GDB MedChem database. Extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data in validating and developing machine learning models for chemistry and material science.
What carries the argument
The Alchemy dataset of molecular quantum properties used to benchmark graph neural networks.
If this is right
- Graph neural network models can be validated at larger scale and with greater molecular diversity than before.
- Machine learning models for predicting quantum properties gain a new resource for training and testing.
- The launched contest draws additional researchers to develop models using the dataset.
- Further molecules generated after the initial 119487 samples increase the available resource for ongoing work.
Where Pith is reading between the lines
- Alchemy could become a standard benchmark that replaces or augments smaller datasets such as QM9 for routine model evaluation.
- Architectures whose performance improves markedly with the added volume may be preferred for scaling to larger chemical systems.
- The sampling from GDB MedChem may allow models trained on Alchemy to transfer more readily to medicinal chemistry applications than models trained only on narrower sets.
Load-bearing premise
The molecules sampled from the GDB MedChem database supply sufficient additional diversity and relevance beyond existing smaller datasets to meaningfully advance model development.
What would settle it
Benchmarks in which the same graph neural network models exhibit identical relative performance and no new validation insights on Alchemy versus prior smaller datasets would falsify the usefulness claim.
Figures
read the original abstract
We introduce a new molecular dataset, named Alchemy, for developing machine learning models useful in chemistry and material science. As of June 20th 2019, the dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. The Alchemy dataset expands the volume and diversity of existing molecular datasets. Our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data in validating and developing machine learning models for chemistry and material science. We further launch a contest to attract attentions from researchers in the related fields. More details can be found on the contest website \footnote{https://alchemy.tencent.com}. At the time of benchamrking experiment, we have generated 119,487 molecules in our Alchemy dataset. More molecular samples are generated since then. Hence, we provide a list of molecules used in the reported benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Alchemy dataset of 119,487 organic molecules (up to 14 heavy atoms) sampled from GDB MedChem, each annotated with 12 quantum mechanical properties. It reports benchmarks of state-of-the-art graph neural network models on this dataset, asserts that the results demonstrate the dataset's usefulness for validating and developing ML models in chemistry, and launches an associated contest.
Significance. If the expanded scale and diversity from GDB MedChem introduce new modeling challenges absent from smaller prior sets, the dataset and benchmarks could support incremental progress in graph-based models for quantum chemistry. The explicit release of the exact molecule list used for benchmarking is a positive reproducibility feature.
major comments (2)
- [Abstract] Abstract: the central claim that 'our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data' is unsupported because no quantitative comparison (property distribution shifts, scaffold novelty, or cross-dataset transfer performance) to existing datasets such as QM9 is provided.
- [Abstract] Abstract: the benchmarking description states dataset size and property count but supplies no error bars, run-to-run variance, or exclusion criteria for the 119,487 molecules, which undermines assessment of the reliability of the GNN results.
minor comments (2)
- [Abstract] Typo: 'benchamrking' should be 'benchmarking'.
- [Abstract] Grammatical: 'attract attentions from researchers' should be 'attract attention from researchers'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript accordingly where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data' is unsupported because no quantitative comparison (property distribution shifts, scaffold novelty, or cross-dataset transfer performance) to existing datasets such as QM9 is provided.
Authors: We acknowledge that the abstract asserts the usefulness of the new data without providing explicit quantitative comparisons to QM9 (such as property distribution shifts or scaffold novelty). The manuscript text notes that Alchemy expands volume and diversity by sampling from GDB MedChem and supplies the exact list of benchmarked molecules, but we agree this does not substitute for direct comparative metrics. We will revise the abstract to moderate the claim and add a new subsection with quantitative comparisons to QM9. revision: yes
-
Referee: [Abstract] Abstract: the benchmarking description states dataset size and property count but supplies no error bars, run-to-run variance, or exclusion criteria for the 119,487 molecules, which undermines assessment of the reliability of the GNN results.
Authors: The manuscript already states that the exact list of 119,487 molecules used for the reported benchmarks is released to support reproducibility, which addresses exclusion criteria. However, we agree that the absence of error bars and run-to-run variance in the benchmarking results limits evaluation of reliability. We will revise the benchmarking section to report these statistics (e.g., standard deviations across multiple runs) and clarify any additional filtering steps applied. revision: yes
Circularity Check
No circularity: dataset release and empirical benchmarks only
full rationale
The paper is a data release (Alchemy sampled from GDB MedChem, 119k molecules, 12 QM properties) plus standard GNN benchmarking. No derivation chain, fitted parameters renamed as predictions, self-citation load-bearing on a uniqueness theorem, or ansatz smuggling exists. The central claim that benchmarks 'manifest usefulness' is an empirical assertion, not a reduction of any output to its own inputs by construction. No equations or self-referential steps are present.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
Weisfeiler-Leman Is Incomplete on Simple Spectrum Graphs, so Canonicalize Them
k-WL is incomplete on simple spectrum graphs; PRiSM is the first provably complete canonicalization for their eigendecompositions.
-
Path-Based Gradient Boosting for Graph-Level Prediction
PathBoost extends path-based gradient boosting with logistic loss, prefix-based multi-attribute handling, and automatic anchor selection, achieving better or comparable results to GNNs and graph kernels on benchmark d...
Reference graph
Works this paper leans on
-
[1]
Cho, K., Van Merri \"e nboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[2]
Defferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS
work page 2016
-
[3]
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database . In CVPR '09
work page 2009
-
[4]
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In NAACL-HLT '19
work page 2019
-
[5]
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural message passing for quantum chemistry. In ICML
work page 2017
-
[6]
N., Duvenaud, D., Hern \'a ndez-Lobato, J
G \'o mez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hern \'a ndez-Lobato, J. M., S \'a nchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-driven Continuous Representation of Molecules . ACS Central Science , 4(2):268--276
work page 2018
-
[7]
Gori, M., Monfardini, G., and Scarselli, F. (2005). A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. , volume 2, pages 729--734. IEEE
work page 2005
-
[8]
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In CVPR '16 , pages 770--778
work page 2016
-
[9]
Jin, W., Barzilay, R., and Jaakkola, T. (2018). Junction Tree Variational Autoencoder for Molecular Graph Generation . ICML '18
work page 2018
-
[10]
Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In ICLR
work page 2017
-
[11]
Knizia, G. (2013). Intrinsic Atomic Orbitals: An Unbiased Bridge between Quantum Theory and Chemical Concepts . Journal of Chemical Theory and Computation , 9(11):4834--4843
work page 2013
-
[12]
Lanczos, C. (1950). An iteration method for the solution of the eigenvalue problem of linear differential and integral operators . United States Governm. Press Office Los Angeles, CA
work page 1950
-
[13]
Li, Y., Tarlow, D., Brockschmidt, M., and Zemel, R. (2016). Gated graph sequence neural networks. In ICLR
work page 2016
-
[14]
Liao, R., Zhao, Z., Urtasun, R., and Zemel, R. S. (2019). Lanczosnet: Multi-scale deep graph convolutional networks. In ICLR
work page 2019
-
[15]
Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E., and Svetnik, V. (2015). Deep neural nets as a method for quantitative structure--activity relationships. Journal of chemical information and modeling , 55(2):263--274
work page 2015
-
[16]
O'Boyle, N. M., Banck, M., James, C. A., Morley, C., Vandermeersch, T., and Hutchison, G. R. (2011). Open babel: An open chemical toolbox. Journal of Cheminformatics , 3(1):33
work page 2011
-
[17]
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). SQuAD: 100,000+ Questions for Machine Comprehension of Text . In EMNLP '16
work page 2016
-
[18]
O., Rupp, M., and Von Lilienfeld, O
Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. (2014). Quantum Chemistry Structures and Properties of 134 Kilo Molecules . Scientific Data , 1:140022
work page 2014
-
[19]
Ramsundar, B., Liu, B., Wu, Z., Verras, A., Tudor, M., Sheridan, R. P., and Pande, V. (2017). Is multitask deep learning practical for pharma? Journal of chemical information and modeling , 57(8):2068--2076
work page 2017
-
[20]
Ruddigkeit, L., Van Deursen, R., Blum, L. C., and Reymond, J.-L. (2012). Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of Chemical Information and Modeling , 52(11):2864--2875
work page 2012
-
[21]
Sanchez-Lengeling, B. and Aspuru-Guzik, A. (2018). Inverse Molecular Design Using Machine Learning: Generative Models for Matter Engineering . Science , 361(6400):360--365
work page 2018
-
[22]
C., Hagenbuchner, M., and Monfardini, G
Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2009). The graph neural network model. IEEE Transactions on Neural Networks , 20(1):61--80
work page 2009
-
[23]
N., Bloem, P., Van Den Berg, R., Titov, I., and Welling, M
Schlichtkrull, M., Kipf, T. N., Bloem, P., Van Den Berg, R., Titov, I., and Welling, M. (2018). Modeling relational data with graph convolutional networks. In European Semantic Web Conference , pages 593--607. Springer
work page 2018
- [24]
-
[25]
Segler, M. H., Preuss, M., and Waller, M. P. (2018). Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI . Nature , 555(7698):604
work page 2018
-
[26]
Shen, Y., Huang, P.-S., Gao, J., and Chen, W. (2017). Reasonet: Learning to Stop Reading in Machine Comprehension . In KDD '17 , pages 1047--1055. ACM
work page 2017
-
[27]
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. (2016). Mastering the Game of Go with Deep Neural Networks and Tree Search . Nature , 529(7587):484
work page 2016
-
[28]
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the Game of Go without Human Knowledge . Nature , 550(7676):354
work page 2017
-
[29]
Sun, Q., Berkelbach, T. C., Blunt, N. S., Booth, G. H., Guo, S., Li, Z., Liu, J., McClain, J. D., Sayfutyarova, E. R., Sharma, S., et al. (2018). Pyscf: the python-based simulations of chemistry framework. Wiley Interdisciplinary Reviews: Computational Molecular Science , 8(1):e1340
work page 2018
-
[30]
Sun, Q. and Chan, G. K.-L. (2014). Exact and optimal quantum mechanics/molecular mechanics boundaries. Journal of Chemical Theory and Computation , 10(9):3784--3790
work page 2014
-
[31]
Veli c kovi \'c , P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2018). Graph attention networks. In ICLR
work page 2018
-
[32]
Vinyals, O., Bengio, S., and Kudlur, M. (2015). Order matters: Sequence to sequence for sets. ICLR '15
work page 2015
-
[33]
Weigend, F. (2002). A Fully Direct RI-HF Algorithm: Implementation , Optimised Auxiliary Basis Sets , Demonstration of Accuracy and Efficiency . Phys. Chem. Chem. Phys. , 4:4285--4291
work page 2002
-
[34]
Weininger, D. (1988). Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. , 28(1):31--36
work page 1988
-
[35]
N., Gomes, J., Geniesse, C., Pappu, A
Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. (2018). MoleculeNet: a Benchmark for Molecular Machine Learning . Chemical Science , 9(2):513--530
work page 2018
-
[36]
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2019). How powerful are graph neural networks? In ICLR
work page 2019
-
[37]
Xu, K., Li, C., Tian, Y., Sonobe, T., Kawarabayashi, K.-i., and Jegelka, S. (2018). Representation learning on graphs with jumping knowledge networks. In ICML
work page 2018
-
[38]
Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. (2018). Hierarchical graph representation learning with differentiable pooling. In NeurIPS , pages 4800--4810
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.