arxiv: 2605.03515 · v1 · submitted 2026-05-05 · ❄️ cond-mat.mtrl-sci

Recognition: unknown

Scale-Dependent Input Representation and Confidence Estimation for LLMs in Materials Property Prediction

Izumi Takahara, Shuichiro Ozawa, Teruyasu Mizoguchi

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:59 UTC · model grok-4.3

classification ❄️ cond-mat.mtrl-sci

keywords large language modelsmaterials property predictioninput representationconfidence estimationcrystal structuresfine-tuningnegative log-likelihoodspace groups

0 comments

The pith

Fine-tuned LLMs predict crystal formation energies and bandgaps more accurately with space-group-inclusive summaries, and mean negative log-likelihood acts as a built-in confidence indicator after fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how different descriptions of inorganic crystals affect the accuracy of two Llama models when predicting formation energy and bandgap after low-rank fine-tuning. Compact inputs work best for the smaller 1B model, while the 8B model handles longer natural-language and CIF descriptions without losing performance. Crystal summaries that add space-group symmetry information beat composition-only inputs for both model sizes. In the fine-tuned models, the average negative log-likelihood of the tokens that make up the predicted numbers tracks the actual error size, offering a way to gauge reliability without any extra training step.

Core claim

We fine-tune Llama models of 1B and 8B parameters on an inorganic crystal structure dataset and compare five input representations for formation energy and bandgap prediction. The optimal representation depends on model scale, with the 1B model favoring compact forms and the 8B model maintaining accuracy on longer descriptions and CIF files. Crystal summaries that include space-group information consistently outperform composition-only inputs at both scales. Fine-tuned models show a clear trend in which lower mean negative log-likelihood of the numerical output tokens corresponds to smaller prediction errors, indicating that mean NLL can serve as a practical confidence measure without added

What carries the argument

Comparison of five input representations (chemical composition, crystal summary with space group, local environment description, full text description, and CIF) across two model scales, together with mean negative log-likelihood of the numerical output tokens as a post-fine-tuning confidence signal.

If this is right

Input representation must be chosen according to model scale to reach best accuracy in crystal property tasks.
Space-group symmetry supplies reliable structural features that improve predictions regardless of model size.
Mean NLL of numerical tokens provides an efficient, training-free way to flag low-confidence outputs in materials LLM applications.
Larger models can accept richer textual or file-based crystal descriptions while retaining predictive performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scale-dependent behavior suggests that future materials LLMs may benefit from adaptive input pipelines that grow in detail with model capacity.
Mean NLL confidence could be combined with existing uncertainty methods to prioritize which predictions warrant expensive follow-up calculations or experiments.
Testing whether the same input and NLL patterns appear when predicting other properties such as elastic moduli or thermal conductivity would clarify how general the findings are.

Load-bearing premise

The observed link between lower mean NLL and smaller prediction error on the test set will continue to hold for crystal compositions and symmetries never seen during fine-tuning and for models trained on different data or with different random seeds.

What would settle it

Running the same fine-tuned models on a fresh collection of crystal structures whose compositions and space groups lie outside the original training distribution and checking whether the inverse relationship between mean NLL and actual error still appears.

Figures

Figures reproduced from arXiv: 2605.03515 by Izumi Takahara, Shuichiro Ozawa, Teruyasu Mizoguchi.

**Figure 1.** Figure 1: Experimental overview and examples of input representations. (a) Experimental pipeline view at source ↗

**Figure 2.** Figure 2: MAE for five input representations under base and fine-tuned model conditions. Four view at source ↗

**Figure 3.** Figure 3: Scatter plots of mean NLL versus absolute error for the fine-tuned 1B model on the view at source ↗

**Figure 4.** Figure 4: Change in MAE with NLL filtering for fine-tuned models. Panels (a) and (b) compare view at source ↗

read the original abstract

Large language models (LLMs) are increasingly applied to materials science. However, the relationship between prediction accuracy, input representation, and model scale remains unclear, and reliable methods for assessing prediction confidence have not yet been established. In this study, we fine-tune two Llama models of different scales (1B and 8B) using low-rank adaptation (LoRA) on an inorganic crystal structure dataset. We systematically evaluate five input representations, namely chemical composition, crystal summary, local environment description, full text description, and crystallographic information files (CIF), for formation energy and bandgap prediction. Our results show that the optimal input representation depends on model scale. The 1B model performs better with compact representations, whereas the 8B model maintains high accuracy even with longer natural-language descriptions and CIF inputs. Across both model scales, crystal summaries that include space-group information consistently outperform composition-only inputs, indicating that symmetry information serves as a robust and informative feature. We further analyze the relationship between prediction error and the mean negative log-likelihood (mean NLL) of tokens corresponding to predicted numerical values. While no clear correlation is observed in base models, fine-tuned models exhibit a consistent trend in which lower mean NLL corresponds to smaller prediction errors. This result suggests that mean NLL can serve as a practical confidence indicator without requiring additional training. These findings demonstrate that both input representation and model scale play critical roles in LLM-based materials property prediction, and that mean NLL provides an effective and computationally efficient measure of prediction confidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper fine-tunes Llama-1B and Llama-8B models with LoRA on an inorganic crystal dataset to predict formation energy and bandgap. It systematically compares five input representations (composition-only, crystal summary, local environment, full text description, and CIF) and reports that optimal representation is scale-dependent: compact inputs suit the 1B model while the 8B model tolerates longer descriptions and CIF files. Crystal summaries that incorporate space-group information outperform composition-only inputs across both scales. In fine-tuned models, lower mean negative log-likelihood (NLL) of the numerical tokens in the output correlates with smaller prediction errors on the held-out test set, leading the authors to propose mean NLL as a training-free confidence indicator.

Significance. If the empirical trends hold, the work supplies practical guidance on input engineering for LLM-based materials property prediction and demonstrates a computationally cheap way to obtain per-prediction confidence scores. The consistent behavior across two model scales and multiple input formats strengthens the applicability of the findings to real materials-informatics workflows.

major comments (1)

[Results on confidence estimation] Results section on mean NLL as confidence indicator: the reported correlation between mean NLL and absolute prediction error is shown exclusively on the in-distribution held-out test split. No out-of-distribution test sets (different elemental ranges, space-group families, or synthesis-condition shifts) or cross-setup experiments (varied random seeds, LoRA ranks, or training datasets) are presented. This leaves open whether mean NLL captures genuine epistemic uncertainty or merely dataset-specific token-probability artifacts, directly weakening the central claim that it constitutes a practical, general-purpose confidence indicator.

minor comments (2)

[Abstract] Abstract: the phrase 'crystal summary' is used without a one-sentence definition of its content; a brief parenthetical description would improve immediate readability.
[Methods] Methods: explicit reporting of train/validation/test split sizes, exact number of unique compositions, and the procedure used to guarantee the test set is strictly held-out would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and valuable suggestions. The feedback on the confidence estimation section is particularly helpful in refining our claims. We provide a point-by-point response below.

read point-by-point responses

Referee: Results section on mean NLL as confidence indicator: the reported correlation between mean NLL and absolute prediction error is shown exclusively on the in-distribution held-out test split. No out-of-distribution test sets (different elemental ranges, space-group families, or synthesis-condition shifts) or cross-setup experiments (varied random seeds, LoRA ranks, or training datasets) are presented. This leaves open whether mean NLL captures genuine epistemic uncertainty or merely dataset-specific token-probability artifacts, directly weakening the central claim that it constitutes a practical, general-purpose confidence indicator.

Authors: We appreciate the referee's point regarding the scope of our experiments on the mean NLL confidence indicator. It is correct that our analysis is performed on the held-out test split from the same dataset distribution. We did not include out-of-distribution evaluations or variations in training setups such as different seeds or LoRA ranks. This limitation means that while we observe a consistent correlation in fine-tuned models across scales and input types, we cannot yet claim that mean NLL fully captures epistemic uncertainty in all scenarios. In response, we will revise the manuscript to explicitly state that the proposed confidence indicator is validated within the in-distribution setting and to include a discussion of its potential limitations. We will also suggest that future work could explore OOD tests to further validate its generality. This revision clarifies the claims without overstating the current evidence. We believe this addresses the concern while preserving the practical utility demonstrated in our results. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on held-out data

full rationale

The paper reports experimental results from fine-tuning Llama models on a crystal dataset and measuring performance across input representations plus the correlation of mean NLL with error. No mathematical derivations, first-principles claims, or predictions are made that could reduce to fitted parameters or self-referential definitions by construction. All reported trends are direct computations from model outputs on held-out splits; the NLL metric requires no additional fitting. No load-bearing self-citations, uniqueness theorems, or ansatzes appear in the provided text, so the central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard supervised fine-tuning assumptions and the premise that LoRA preserves enough capacity to learn the mapping from text to numerical properties. No new physical axioms or invented entities are introduced.

axioms (1)

domain assumption LoRA fine-tuning on the chosen dataset is sufficient to adapt the base LLM to the regression task without catastrophic forgetting of general capabilities.
Implicit in the decision to use LoRA rather than full fine-tuning or prompt-only methods.

pith-pipeline@v0.9.0 · 5586 in / 1304 out tokens · 24310 ms · 2026-05-07T15:59:00.579340+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 15 canonical work pages · 5 internal anchors

[1]

Language Models are Few-Shot Learners

T. B. Brown, B. Mann, N. Ryderet al., “Language models are few-shot learners,”arXiv [cs.CL], 28 May 2020, arXiv:2005.14165

work page internal anchor Pith review arXiv 2020
[2]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhriet al., “The llama 3 herd of models,”arXiv [cs.AI], 31 Jul. 2024, arXiv:2407.21783

work page internal anchor Pith review arXiv 2024
[3]

LLMatDesign: Autonomous materials discovery with large language models,

S. Jia, C. Zhang, and V. Fung, “LLMatDesign: Autonomous materials discovery with large language models,”arXiv [cond-mat.mtrl-sci], 18 Jun. 2024, arXiv:2406.13163

work page arXiv 2024
[4]

Accelerated inorganic materials design with generative AI agents,

I. Takahara, T. Mizoguchi, and B. Liu, “Accelerated inorganic materials design with generative AI agents,”Cell Rep. Phys. Sci., vol. 6, no. 12, p. 103019, 17 Dec. 2025

2025
[5]

MatKG: An autonomously generated knowledge graph in material science,

V. Venugopal and E. Olivetti, “MatKG: An autonomously generated knowledge graph in material science,”Sci. Data, vol. 11, no. 1, p. 217, 17 Feb. 2024. 11

2024
[6]

A general-purpose material property data ex- traction pipeline from large polymer corpora using natural language processing,

P. Shetty, A. C. Rajan, C. Kuennethet al., “A general-purpose material property data ex- traction pipeline from large polymer corpora using natural language processing,”Npj Comput. Mater., vol. 9, no. 1, p. 52, 5 Apr. 2023

2023
[7]

LLM-prop: predicting the properties of crystalline materials using large language models,

A. Niyongabo Rubungo, C. Arnold, B. P. Randet al., “LLM-prop: predicting the properties of crystalline materials using large language models,”Npj Comput. Mater., vol. 11, no. 1, pp. 1–13, 18 Jun. 2025

2025
[8]

MatterChat: A multi-modal LLM for material science,

Y. Tang, W. Xu, J. Caoet al., “MatterChat: A multi-modal LLM for material science,”arXiv [cs.AI], 18 Feb. 2025, arXiv:2502.13107

work page arXiv 2025
[9]

Domain-specific large language model for predicting band gap and formation energy of III-VIIIB and III-IVA nitrides based on fine-tuned GPT-3.5-turbo,

L. Hu and G. Jia, “Domain-specific large language model for predicting band gap and formation energy of III-VIIIB and III-IVA nitrides based on fine-tuned GPT-3.5-turbo,”Mach. Learn. Sci. Technol., vol. 6, no. 3, p. 035019, 30 Sep. 2025

2025
[10]

LLM4Mat-bench: benchmarking large language models for materials property prediction,

A. Niyongabo Rubungo, K. Li, J. Hattrick-Simperset al., “LLM4Mat-bench: benchmarking large language models for materials property prediction,”Mach. Learn. Sci. Technol., vol. 6, no. 2, p. 020501, 30 Jun. 2025

2025
[11]

MatSciBERT: A materials domain language model for text mining and information extraction,

T. Gupta, M. Zaki, N. M. A. Krishnanet al., “MatSciBERT: A materials domain language model for text mining and information extraction,”Npj Comput. Mater., vol. 8, no. 1, pp. 1–11, 3 May 2022

2022
[12]

Galactica: A Large Language Model for Science

R. Taylor, M. Kardas, G. Cucurullet al., “Galactica: A large language model for science,” arXiv [cs.CL], 16 Nov. 2022, arXiv:2211.09085

work page internal anchor Pith review arXiv 2022
[13]

Foundational large language models for materials research,

V. Mishra, S. Singh, D. Ahlawatet al., “Foundational large language models for materials research,”arXiv [cond-mat.mtrl-sci], 12 Dec. 2024, arXiv:2412.09560

work page arXiv 2024
[14]

Accurate, interpretable predictions of materials properties within transformer language models,

V. Korolev and P. Protsenko, “Accurate, interpretable predictions of materials properties within transformer language models,”Patterns (N. Y.), vol. 4, no. 10, p. 100803, 13 Oct. 2023

2023
[15]

Regression with Large Language Models for Materials and Molecular Property Prediction

R. Jacobs, M. P. Polak, L. E. Schultzet al., “Regression with large language models for materials and molecular property prediction,”arXiv [cond-mat.mtrl-sci], 9 Sep. 2024, arXiv:2409.06080

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Less can be more for predicting properties with large language models,

N. Alampara, S. Miret, and K. M. Jablonka, “Less can be more for predicting properties with large language models,”arXiv [cond-mat.mtrl-sci], 9 Jul. 2025, arXiv:2406.17295

work page arXiv 2025
[17]

Predicting materials properties without crystal structure: deep representation learning from stoichiometry,

R. E. A. Goodall and A. A. Lee, “Predicting materials properties without crystal structure: deep representation learning from stoichiometry,”Nat. Commun., vol. 11, no. 1, p. 6280, 8 Dec. 2020

2020
[18]

Compositionally restricted attention- based network for materials property predictions,

A. Y.-T. Wang, S. K. Kauwe, R. J. Murdocket al., “Compositionally restricted attention- based network for materials property predictions,”Npj Comput. Mater., vol. 7, no. 1, p. 77, 28 May 2021

2021
[19]

CrysMMNet: Multimodal representation for crystal prop- erty prediction,

K. Das, P. Goyal, S.-C. Leeet al., “CrysMMNet: Multimodal representation for crystal prop- erty prediction,” inUncertainty in Artificial Intelligence. PMLR, 2 Jul. 2023, pp. 507–517

2023
[20]

Large language models for material property predictions: elastic constant tensor prediction and materials design,

S. Liu, T. Wen, B. Yeet al., “Large language models for material property predictions: elastic constant tensor prediction and materials design,”Digit. Discov., vol. 4, no. 6, pp. 1625–1638, 2025. 12

2025
[21]

Fine-tuned language models generate stable inor- ganic materials as text,

N. Gruver, A. Sriram, A. Madottoet al., “Fine-tuned language models generate stable inor- ganic materials as text,”arXiv [cs.LG], 6 Feb. 2024, arXiv:2402.04379

work page arXiv 2024
[22]

Crystal structure generation with autore- gressive large language modeling,

L. M. Antunes, K. T. Butler, and R. Grau-Crespo, “Crystal structure generation with autore- gressive large language modeling,”Nat. Commun., vol. 15, no. 1, p. 10570, 6 Dec. 2024

2024
[23]

Materials property prediction with uncertainty quantification: A benchmark study,

D. Varivoda, R. Dong, S. S. Omeeet al., “Materials property prediction with uncertainty quantification: A benchmark study,”Appl. Phys. Rev., vol. 10, no. 2, p. 021409, 1 Jun. 2023

2023
[24]

Ensemble methods in machine learning,

T. G. Dietterich, “Ensemble methods in machine learning,” inMultiple Classifier Systems, ser. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp. 1–15

2000
[25]

arXiv e-prints , keywords =

R. Seoh, “Qualitative analysis of monte carlo dropout,”arXiv [stat.ML], 3 Jul. 2020, arXiv:2007.01720

work page arXiv 2020
[26]

Deep evidential regression,

A. Amini, W. Schwarting, A. Soleimanyet al., “Deep evidential regression,”Advances in Neural Information Processing Systems, vol. 33, pp. 14 927–14 937, 2020

2020
[27]

Uncertainty sets for image classifiers using conformal prediction.arXiv:2009.14193,

A. Angelopoulos, S. Bates, J. Maliket al., “Uncertainty sets for image classifiers using confor- mal prediction,”arXiv [cs.CV], 29 Sep. 2020, arXiv:2009.14193

work page arXiv 2020
[28]

Materials property prediction with uncertainty quantification: A benchmark study,

D. Varivoda, R. Dong, S. S. Omeeet al., “Materials property prediction with uncertainty quantification: A benchmark study,”arXiv [cond-mat.mtrl-sci], 3 Nov. 2022, arXiv:2211.02235

work page arXiv 2022
[29]

Survey of hallucination in natural language generation,

Z. Ji, N. Lee, R. Frieskeet al., “Survey of hallucination in natural language generation,”ACM Comput. Surv., 17 Nov. 2022

2022
[30]

A survey of confidence estimation and calibration in large language models,

J. Geng, F. Cai, Y. Wanget al., “A survey of confidence estimation and calibration in large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2024, pp. 6577–6595

2024
[31]

Benchmarking uncertainty quantification meth- ods for large language models with LM-polygraph,

R. Vashurin, E. Fadeeva, A. Vazhentsevet al., “Benchmarking uncertainty quantification meth- ods for large language models with LM-polygraph,”Trans. Assoc. Comput. Linguist., vol. 13, pp. 220–248, 19 Mar. 2025

2025
[32]

Leveraging language representation for materials exploration and discovery,

J. Qu, Y. R. Xie, K. M. Ciesielskiet al., “Leveraging language representation for materials exploration and discovery,”Npj Comput. Mater., vol. 10, no. 1, p. 58, 21 Mar. 2024

2024
[33]

AtomGPT: Atomistic generative pretrained transformer for forward and in- verse materials design,

K. Choudhary, “AtomGPT: Atomistic generative pretrained transformer for forward and in- verse materials design,”J. Phys. Chem. Lett., vol. 15, no. 27, pp. 6909–6917, 11 Jul. 2024

2024
[34]

Accelerated data-driven materials science with the materials project,

M. K. Horton, P. Huck, R. X. Yanget al., “Accelerated data-driven materials science with the materials project,”Nat. Mater., vol. 24, no. 10, pp. 1522–1532, 3 Oct. 2025

2025
[35]

Robocrystallographer: automated crystal structure text descrip- tions and analysis,

A. M. Ganose and A. Jain, “Robocrystallographer: automated crystal structure text descrip- tions and analysis,”MRS Commun., vol. 9, no. 3, pp. 874–881, 20 Sep. 2019

2019
[36]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y. Shen, P. Walliset al., “LoRA: Low-rank adaptation of large language models,” arXiv [cs.CL], 17 Jun. 2021, arXiv:2106.09685. 13

work page internal anchor Pith review arXiv 2021
[37]

Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties,

T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties,”arXiv [cond-mat.mtrl-sci], 27 Oct. 2017, arXiv:1710.10324

work page arXiv 2017
[38]

Obtaining well calibrated probabilities using bayesian binning,

M. Pakdaman Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,”Proc. Conf. AAAI Artif. Intell., vol. 29, no. 1, 21 Feb. 2015

2015
[39]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y. Sunet al., “On calibration of modern neural networks,”arXiv [cs.LG], 14 Jun. 2017, arXiv:1706.04599

work page arXiv 2017
[40]

The art of abstention: Selective prediction and error reg- ularization for natural language processing,

J. Xin, R. Tang, Y. Yuet al., “The art of abstention: Selective prediction and error reg- ularization for natural language processing,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA, USA: Associati...

2021