Regression with Large Language Models for Materials and Molecular Property Prediction
Pith reviewed 2026-05-23 21:08 UTC · model grok-4.3
The pith
Fine-tuned LLaMA 3 on SMILES strings rivals random forests for QM9 molecular property regression.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaMA 3, when fine-tuned using the SMILES representation of molecules and only the generative loss, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. On 28 materials properties it supplies comparable though slightly worse accuracy relative to random forest and elemental descriptors when given only compound chemical descriptions. Errors remain 5-10 times higher than those of state-of-the-art models that receive atom types and coordinates.
What carries the argument
Fine-tuning LLaMA 3 on the generative loss with SMILES strings or chemical composition descriptions as sole inputs for property-value regression.
If this is right
- Language models can serve as drop-in regressors for physical properties when supplied with string representations alone.
- Standard generative training objectives can implicitly encode quantitative structure-property relationships.
- Materials and molecular datasets expressed as SMILES or composition strings become directly usable for LLM-based prediction.
- LLM performance on these tasks improves over GPT-3.5 and GPT-4o, pointing to scaling advantages within the LLaMA family.
Where Pith is reading between the lines
- The same string-only fine-tuning recipe could be tested on other string-representable scientific domains such as protein sequences or crystal prototypes.
- Hybrid pipelines that feed LLM outputs as priors into atomistic models might reduce the accuracy gap to coordinate-based methods.
- If next-token prediction already captures numerical trends, further gains may come from larger context windows rather than new loss functions.
Load-bearing premise
Fine-tuning solely on the generative loss with only composition-based string inputs is sufficient to learn accurate numerical property predictions without requiring more granular structural representations or regression-specific objectives.
What would settle it
A direct comparison on the QM9 dataset in which LLaMA 3 errors stay more than five times larger than those of a random forest baseline even after identical fine-tuning protocols.
read the original abstract
We demonstrate the ability of large language models (LLMs) to perform material and molecular property regression tasks, a significant deviation from the conventional LLM use case. We benchmark the Large Language Model Meta AI (LLaMA) 3 on several molecular properties in the QM9 dataset and 28 materials properties. Only composition-based input strings are used as the model input and we fine tune on only the generative loss. We broadly find that LLaMA 3, when fine-tuned using the SMILES representation of molecules, provides useful regression results which can rival standard materials property prediction models like random forest or fully connected neural networks on the QM9 dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the state-of-the-art models that were trained using far more granular representation of molecules (e.g., atom types and their coordinates) for the same task. Similarly, LLaMA 3 provides comparable, although slightly worse, accuracy relative to random forest and elemental descriptors when using just compound chemical description on our set of 28 materials properties. Interestingly, LLaMA 3 provides improved predictions compared to GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting that LLM-like generative models can potentially transcend their traditional applications to tackle complex physical phenomena, thus paving the way for future research and applications in chemistry, materials science and other scientific domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that LLaMA 3, fine-tuned exclusively on the generative (next-token prediction) loss using only composition-based string inputs (SMILES for molecules, chemical descriptions for materials), can perform regression for molecular properties on the QM9 dataset and 28 materials properties. It reports that the resulting accuracies rival those of random forest and fully connected neural networks on QM9 while remaining 5-10x worse than coordinate-based SOTA models, and that LLaMA 3 outperforms GPT-3.5 and GPT-4o on the materials tasks.
Significance. If the empirical benchmarks hold under rigorous verification, the work would demonstrate that standard generative fine-tuning of an LLM can yield usable numerical property predictions from minimal string inputs, thereby extending LLM applicability to regression tasks in chemistry and materials science without custom regression heads or structural coordinates. This would be a concrete, falsifiable illustration of the versatility of generative models for physical-property tasks.
major comments (3)
- [Methods] Methods (fine-tuning procedure): The central claim that 'fine tune on only the generative loss' produces accurate regression rests on an unstated assumption that next-token prediction on formatted target strings will implicitly minimize numerical error. No details are given on output parsing (e.g., regex extraction, handling of units or scientific notation), temperature during inference, or whether any auxiliary regression loss or value-head was used; without these, it is impossible to rule out that reported performance arises from mode collapse to mean values or post-hoc averaging rather than genuine property learning.
- [Results] Results (QM9 benchmarks): The claim that LLaMA 3 'rivals' random forest and FCNN is load-bearing for the abstract's main conclusion, yet the manuscript provides no information on the precise train/test splits, whether the baselines used identical splits or the same SMILES-derived features, or any statistical comparison (error bars, p-values). This omission directly affects whether the reported parity is reproducible or an artifact of experimental setup.
- [Materials properties] Materials properties section: The input representation is described only as 'compound chemical description.' Without an explicit example or enumeration of the string format (element counts, stoichiometry notation, etc.), it is unclear how the input granularity compares to the 'elemental descriptors' used by the random-forest baseline, undermining the direct comparability asserted in the abstract.
minor comments (2)
- [Abstract] The abstract states 'we broadly find' without quantifying the number of properties or runs; a concise table summarizing all QM9 targets and the 28 materials properties would improve clarity.
- [Figures] Figure captions and axis labels should explicitly state the error metric (MAE, RMSE, etc.) and units for each property to allow immediate comparison with literature values.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments, which have helped us improve the clarity and reproducibility of the manuscript. We address each major comment below and have revised the paper accordingly to incorporate additional methodological details, experimental specifications, and input examples.
read point-by-point responses
-
Referee: [Methods] Methods (fine-tuning procedure): The central claim that 'fine tune on only the generative loss' produces accurate regression rests on an unstated assumption that next-token prediction on formatted target strings will implicitly minimize numerical error. No details are given on output parsing (e.g., regex extraction, handling of units or scientific notation), temperature during inference, or whether any auxiliary regression loss or value-head was used; without these, it is impossible to rule out that reported performance arises from mode collapse to mean values or post-hoc averaging rather than genuine property learning.
Authors: We agree that the original Methods section lacked sufficient detail on these aspects. In the revised manuscript, we have added a new subsection detailing the fine-tuning and inference procedure. This includes: the use of regex-based parsing to extract numerical values from generated text (with explicit handling for scientific notation and units), inference performed at temperature 0.0 for deterministic outputs, and explicit confirmation that only the standard generative next-token prediction loss was used with no auxiliary regression loss or value head. These additions clarify that the reported performance stems from the fine-tuned model's learned associations rather than post-processing artifacts. revision: yes
-
Referee: [Results] Results (QM9 benchmarks): The claim that LLaMA 3 'rivals' random forest and FCNN is load-bearing for the abstract's main conclusion, yet the manuscript provides no information on the precise train/test splits, whether the baselines used identical splits or the same SMILES-derived features, or any statistical comparison (error bars, p-values). This omission directly affects whether the reported parity is reproducible or an artifact of experimental setup.
Authors: We acknowledge this omission and have revised the Results section to specify that the QM9 experiments employed the standard train/test splits from the original QM9 dataset publication. The random forest and FCNN baselines were re-implemented using identical splits and SMILES-derived features for direct comparability. We now include error bars computed over multiple independent runs and note the absence of statistically significant differences where the performances are comparable, supporting the reproducibility of the parity claim. revision: yes
-
Referee: [Materials properties] Materials properties section: The input representation is described only as 'compound chemical description.' Without an explicit example or enumeration of the string format (element counts, stoichiometry notation, etc.), it is unclear how the input granularity compares to the 'elemental descriptors' used by the random-forest baseline, undermining the direct comparability asserted in the abstract.
Authors: We have revised the Materials properties section to include explicit examples of the compound chemical description strings (e.g., formats incorporating element counts and stoichiometry such as 'Compound with 3 atoms of Fe and 4 atoms of O'). This addition enables readers to directly compare the input granularity to the elemental descriptors in the random forest baseline and supports the comparability asserted in the abstract. revision: yes
Circularity Check
No circularity: purely empirical benchmarking against external baselines.
full rationale
The paper reports fine-tuning LLaMA 3 on generative loss using SMILES/composition strings, then measures regression performance on QM9 and 28 materials properties via direct comparison to independent models (random forest, FCNN, GPT variants). No derivation chain, fitted parameters renamed as predictions, self-citations, or ansatzes exist; all claims rest on held-out test metrics and external SOTA references. The central result is falsifiable by re-running the benchmarks and does not reduce to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Composition-based input strings contain sufficient information to support useful property regression when the model is fine-tuned on generative loss.
Forward citations
Cited by 2 Pith papers
-
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
-
Scale-Dependent Input Representation and Confidence Estimation for LLMs in Materials Property Prediction
Larger LLMs handle detailed crystal descriptions better than small ones, and mean negative log-likelihood of predicted numbers tracks prediction error after fine-tuning.
Reference graph
Works this paper leans on
-
[1]
ChemBERTa: large -scale self -supervised pretraining fo r molecular property prediction
(1) Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. ArXiv 2020, abs/2010.09885. (2) Ahmad, W.; Simon, E.; Chithrananda, S.; Grand, G.; Ramsundar, B. ChemBERTa-2: Towards Chemical Foundation Models
-
[2]
(4) Wang, S.; Guo, Y .; Wang, Y .; Sun, H.; Huang, J. SMILES-BERT. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics; ACM: New York, NY , USA, 2019; pp 429–436. https://doi.org/10.1145/3307339.3342186. (5) Irwin, R.; Dimitriadis, S.; He, J.; Bjerrum, E. J. Chemformer: A Pre-Trained Tran...
-
[3]
ReLM: Leveraging Language Models for Enhanced Chemical Reaction Prediction
(10) Shi, Y .; Zhang, A.; Zhang, E.; Liu, Z.; Wang, X. ReLM: Leveraging Language Models for Enhanced Chemical Reaction Prediction. In Findings of the Association for Computational Linguistics: EMNLP 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp 5506–5520. https://doi.org/10.18653/v1/2023.findings-emnlp.366. (11) Pei, Q.; ...
-
[4]
(13) Cao, H.; Liu, Z.; Lu, X.; Yao, Y .; Li, Y
https://arxiv.org/abs/2401.14818. (13) Cao, H.; Liu, Z.; Lu, X.; Yao, Y .; Li, Y. InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery
-
[5]
N.; Chen, Z.; Ning, X.; Sun, H
(14) Yu, B.; Baker, F. N.; Chen, Z.; Ning, X.; Sun, H. LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset. ArXiv 2024, abs/2402.09391. (15) Sadeghi, S.; Bui, A.; Forooghi, A.; Lu, J.; Ngom, A. Can Large Language Models Understand Molecules? BMC Bioinformatics 2024, 25 (1),
-
[6]
https://doi.org/10.1186/s12859-024-05847-x. (16) Jablonka, K. M.; Schwaller, P .; Ortega-Guerrero, A.; Smit, B. Leveraging Large Language Models for Predictive Chemistry. Nat Mach Intell 2024, 6 (2), 161–169. https://doi.org/10.1038/s42256-023- 00788-1. (17) Ramakrishnan, R.; Dral, P . O.; Rupp, M.; Von Lilienfeld, O. A. Quantum Chemistry Structures and P...
-
[7]
(24) Pilania, G.; Mannodi-Kanakkithodi, A.; Uberuaga, B
https://doi.org/10.1088/1367-2630/16/1/015018. (24) Pilania, G.; Mannodi-Kanakkithodi, A.; Uberuaga, B. P .; Ramprasad, R.; Gubernatis, J. E.; Lookman, T. Machine Learning Bandgaps of Double Perovskites. Sci Rep 2016, 6 (1), 19375. https://doi.org/10.1038/srep19375. (25) de Jong, M.; Chen, W.; Angsten, T.; Jain, A.; Notestine, R.; Gamst, A.; Sluiter, M.; ...
-
[8]
(27) Yang, C.; Ren, C.; Jia, Y .; Wang, G.; Li, M.; Lu, W
https://doi.org/10.1038/s41524-020-00440-1. (27) Yang, C.; Ren, C.; Jia, Y .; Wang, G.; Li, M.; Lu, W. A Machine Learning-Based Alloy Design System to Facilitate the Rational Design of High Entropy Alloys with Enhanced Hardness. Acta Mater 2022, 222, 117431. https://doi.org/10.1016/j.actamat.2021.117431. 24 (28) Hargreaves, C. J.; Gaultois, M. W.; Daniels...
-
[9]
(29) Voyles, P .; Schultz, L.; Morgan, D.; Francis, C.; Afflerbach, B.; Hakeem, A
https://doi.org/10.1038/s41524-022-00951-z. (29) Voyles, P .; Schultz, L.; Morgan, D.; Francis, C.; Afflerbach, B.; Hakeem, A. Metallic Glasses and their Properties. https://foundry-ml.org/#/datasets/10.18126%2F7yg1-osf2 (accessed 2024-02-20). (30) Polak, M. P .; Morgan, D. Extracting Accurate Materials Data from Research Papers with Conversational Langua...
-
[10]
https://doi.org/10.1038/s41467-024-45914-8. (31) Emery, A. A.; Wolverton, C. High-Throughput DFT Calculations of Formation Energy, Stability and Oxygen Vacancy Formation Energy of ABO3 Perovskites. Sci Data 2017, 4 (1), 170153. https://doi.org/10.1038/sdata.2017.153. (32) Castelli, I. E.; Olsen, T.; Datta, S.; Landis, D. D.; Dahl, S.; Thygesen, K. S.; Jac...
-
[11]
A Universal Framework for Accurate and Efficient Geometric Deep Learning of Molecular Systems
(41) Zhang, S.; Liu, Y .; Xie, L. A Universal Framework for Accurate and Efficient Geometric Deep Learning of Molecular Systems. Sci Rep 2023, 13, 19171. https://doi.org/https://doi.org/10.1038/s41598-023-46382-8. (42) Pinheiro, G. A.; Mucelini, J.; Soares, M. D.; Prati, R. C.; Da Silva, J. L. F.; Quiles, M. G. Machine Learning Prediction of Nine Molecula...
-
[12]
Self-Referencing Embedded Strings (SELFIES): A 100% Robust Molecular String Representation
(44) Krenn, M.; Häse, F.; Nigam, A.; Friederich, P .; Aspuru-Guzik, A. Self-Referencing Embedded Strings (SELFIES): A 100% Robust Molecular String Representation. Mach Learn Sci Technol 2020, 1 (4), 045024. https://doi.org/10.1088/2632-2153/aba947. (45) Jacobs, R.; Schultz, L.; Scourtas, A.; Schmidt, K. J.; Price-Skelly, O.; Engler, W. Machine Learning Ma...
-
[13]
(46) Goodall, R. E. A.; Lee, A. A. Predicting Materials Properties without Crystal Structure: Deep Representation Learning from Stoichiometry. Nat Commun 2020, 11 (1),
work page 2020
-
[14]
https://doi.org/10.1038/s41467-020-19964-7
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.