Large language model-enabled automated data extraction for concrete materials informatics
Pith reviewed 2026-05-08 11:13 UTC · model grok-4.3
The pith
An LLM pipeline extracts nearly 9,000 high-quality concrete material records from over 27,000 papers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a generalizable LLM-powered pipeline that extracts and structures materials data from unstructured literature, using concrete as a test case. The pipeline performs robustly across many LLMs, reaching an F1 score up to 0.97 on composition-process-property attributes. In one hour it produces nearly 9,000 high-quality records with more than 100 attributes from over 27,000 publications, forming the largest open laboratory database for blended cement concrete. Machine-learning tests confirm that larger, more diverse extracted datasets improve both in-distribution accuracy and out-of-distribution generalization.
What carries the argument
The LLM-powered pipeline that reads papers and outputs structured records on composition, process, and property attributes.
If this is right
- Materials researchers can now build large, open experimental datasets in hours instead of years of manual curation.
- Machine-learning models trained on these datasets show improved accuracy on both known and previously unseen concrete formulations.
- The same pipeline can be reused in other materials domains without major redesign.
- Scalable literature-to-data conversion becomes a practical route to the data infrastructures needed for materials informatics.
Where Pith is reading between the lines
- The method could be applied to other text-heavy scientific fields where experimental results sit in journal articles rather than databases.
- Downstream users may still need targeted human checks on the most critical attributes before using the data for safety-critical predictions.
- Combining this extraction step with active learning loops could let models request additional literature on the materials they predict least accurately.
Load-bearing premise
That the records extracted by the language model are free of systematic errors, omissions, or biases that would mislead later machine-learning analyses.
What would settle it
A side-by-side comparison of the pipeline's output against a human-curated gold-standard set of several hundred papers, reporting exact agreement rates per attribute and any consistent patterns of omission.
read the original abstract
The promise of data-driven materials discovery remains constrained by the scarcity of large, high-quality, and accessible experimental datasets. Here, we introduce a generalizable large language model (LLM)-powered pipeline for automated extraction and structuring of materials data from unstructured scientific literature, using concrete materials as a representative and particularly challenging example. The pipeline exhibits robust performance across a broad range of LLMs and achieves an $F_1$ score of up to 0.97 for diverse composition--process--property attributes. Within one hour, it extracts nearly 9,000 high-quality records with over 100 attributes screened from more than 27,000 publications, enabling the construction of the largest open laboratory database for blended cement concrete. Machine learning analyses underscore the importance of large, diverse, and information-rich datasets for enhancing both in-distribution accuracy and out-of-distribution generalization to unseen materials. The proposed pipeline is readily adaptable to other materials domains and accelerates the development of scalable data infrastructures for materials informatics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an LLM-powered pipeline for automated extraction and structuring of composition-process-property data from unstructured concrete materials literature. It reports robust performance across multiple LLMs with F1 scores up to 0.97, extraction of nearly 9,000 high-quality records (over 100 attributes) from >27,000 publications in under an hour, construction of the largest open lab database for blended cement concrete, and downstream ML experiments demonstrating improved in-distribution accuracy and out-of-distribution generalization from larger, more diverse datasets. The pipeline is presented as generalizable to other materials domains.
Significance. If the reported extraction quality holds under rigorous validation, the work would provide a scalable, domain-adaptable tool that directly addresses data scarcity in materials informatics. The scale of the extracted dataset and the explicit demonstration that larger/diverse data improves ML generalization are concrete strengths; the open release of the resulting database would further amplify impact. The approach could accelerate similar efforts in other subfields where literature is abundant but structured data is sparse.
major comments (3)
- [§3 and §4] §3 (Methods) and §4 (Results): The F1 score of up to 0.97 is presented as the central performance metric, yet the manuscript provides no numerical size for the human validation set, no inter-annotator agreement statistic, and no breakdown of error types (e.g., omission of ambiguous w/c ratios or fly-ash replacement clauses). Without these quantities, it is impossible to assess whether the headline performance claim is robust against the known ambiguities in concrete literature.
- [§4.2] §4.2 (Post-processing and filtering): The criteria used to select the final ~9,000 “high-quality” records from the raw LLM outputs are not specified (e.g., confidence thresholds, attribute completeness rules, or manual review fraction). This choice directly affects the claim that the extracted database is suitable for downstream ML analyses and must be documented with quantitative justification.
- [§5] §5 (ML analyses): The out-of-distribution generalization experiments rely on the extracted records being unbiased; however, no sensitivity analysis is shown that quantifies how plausible systematic extraction errors (e.g., under-reporting of low-strength mixes) would propagate into the reported accuracy gains.
minor comments (2)
- [Figure 2, Table 1] Figure 2 and Table 1: axis labels and legend entries use inconsistent abbreviations for attributes (e.g., “w/c” vs. “water-cement ratio”); standardize notation for readability.
- [Abstract and §4] The abstract states “within one hour” but the main text does not report wall-clock time or hardware details for the 27k-paper run; add this information to support the scalability claim.
Simulated Author's Rebuttal
We are grateful to the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below, indicating where revisions have been made to address the concerns.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Methods) and §4 (Results): The F1 score of up to 0.97 is presented as the central performance metric, yet the manuscript provides no numerical size for the human validation set, no inter-annotator agreement statistic, and no breakdown of error types (e.g., omission of ambiguous w/c ratios or fly-ash replacement clauses). Without these quantities, it is impossible to assess whether the headline performance claim is robust against the known ambiguities in concrete literature.
Authors: We agree that these details are necessary for a complete assessment of our validation results. Although the validation process was described at a high level, the specific quantities were not reported. In the revised manuscript, we have added the size of the human validation set, the inter-annotator agreement statistic, and a breakdown of error types to §3 and §4. This includes discussion of how errors related to ambiguous clauses in the literature were handled. revision: yes
-
Referee: [§4.2] §4.2 (Post-processing and filtering): The criteria used to select the final ~9,000 “high-quality” records from the raw LLM outputs are not specified (e.g., confidence thresholds, attribute completeness rules, or manual review fraction). This choice directly affects the claim that the extracted database is suitable for downstream ML analyses and must be documented with quantitative justification.
Authors: We thank the referee for pointing this out. The selection criteria were applied but not fully detailed in the original submission. We have revised §4.2 to explicitly state the post-processing and filtering criteria, including any confidence thresholds, completeness requirements, and the extent of manual review, supported by quantitative metrics on how these choices impacted the final dataset. revision: yes
-
Referee: [§5] §5 (ML analyses): The out-of-distribution generalization experiments rely on the extracted records being unbiased; however, no sensitivity analysis is shown that quantifies how plausible systematic extraction errors (e.g., under-reporting of low-strength mixes) would propagate into the reported accuracy gains.
Authors: This is a valid concern regarding the robustness of our ML findings. While we believe the large scale of the dataset mitigates some biases, we have added a sensitivity analysis to §5 in the revised manuscript. This analysis simulates the effects of potential systematic errors in the extracted data and confirms that the improvements in out-of-distribution generalization remain significant. revision: yes
Circularity Check
No circularity detected; performance claims rest on external validation
full rationale
The paper describes an empirical LLM pipeline for literature data extraction and reports measured F1 scores and record counts. No equations, derivations, or self-referential definitions appear in the abstract or summary. Performance is stated as evaluated against ground truth rather than being forced by internal fits or self-citations. The central claims therefore remain independent of the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can accurately extract structured composition-process-property data from unstructured concrete literature across diverse paper styles
Reference graph
Works this paper leans on
-
[1]
Agrawal, A. & Choudhary, A. Perspective: Materials informatics and big data: Realization of the ”fourth paradigm” of science in materials science. APL Mater. 4 (2016)
work page 2016
-
[2]
Morgan, D. & Jacobs, R. Opportunities and Challenges for Machine Learning in Materials Science. Annu. Rev. Mater. Res . 50, 71–103 (2020)
work page 2020
-
[3]
Batra, R., Song, L. & Ramprasad, R. Emerging materials intelligence ecosystems propelled by machine learning. Nat. Rev. Mater . 6, 655–678 (2021)
work page 2021
- [4]
-
[5]
Olivetti, E. A. et al. Data-driven materials research enabled by natural language processing and information extraction. Appl. Phys. Rev . 7 (2020)
work page 2020
-
[6]
Schilling-Wilhelmi, M. et al. From Text to Insight: Large Language Models for Materials Science Data Extraction. Chem. Soc. Rev. 54, 1125–1150 (2025)
work page 2025
-
[7]
Curtarolo, S. et al. AFLOW: An automatic framework for high-throughput materials discovery.Comput. Mater. Sci. 58, 218–226 (2012)
work page 2012
-
[8]
E., Kirklin, S., Aykol, M., Meredig, B
Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery with high-throughput density functional theory: The open quantum materials database (OQMD). Jom 65, 1501–1509 (2013)
work page 2013
-
[9]
Jain, A. et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1 (2013)
work page 2013
-
[10]
Draxl, C. & Scheffler, M. The NOMAD laboratory: From data sharing to artificial intelligence. JPhys Mater. 2 (2019)
work page 2019
-
[11]
Choudhary, K. et al. The joint automated repository for various integrated simulations (JARVIS) for data-driven materials design. npj Comput. Mater . 6 (2020)
work page 2020
-
[12]
Talirz, L. et al. Materials Cloud, a platform for open computational science. Sci. Data 7, 1–12 (2020)
work page 2020
-
[13]
Dagdelen, J. et al. Structured information extraction from scientific text with large language models. Nat. Commun. 15 (2024)
work page 2024
-
[14]
Kononova, O. et al. Opportunities and challenges of text mining in materials research. iScience 24, 102155 (2021)
work page 2021
-
[15]
Swain, M. C. & Cole, J. M. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature. J. Chem. Inf. Model . 56, 1894–1904 (2016)
work page 1904
-
[16]
Shetty, P. et al. A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing. npj Comput. Mater . 9, 1–12 (2023)
work page 2023
-
[17]
Jensen, Z. et al. A Machine Learning Approach to Zeolite Synthesis Enabled by Automatic Literature Data Extraction. ACS Cent. Sci . (2019). 16
work page 2019
-
[18]
Kononova, O. et al. Text-mined dataset of inorganic materials synthesis recipes. Sci. Data 6, 1–11 (2019)
work page 2019
-
[19]
Kim, E. et al. Machine-learned and codified synthesis parameters of oxide materials. Sci. data 4, 170127 (2017)
work page 2017
-
[20]
Wang, W. et al. Automated pipeline for superalloy data by text mining. npj Comput. Mater . 8, 1–12 (2022)
work page 2022
- [21]
-
[22]
Wilary, D. M. & Cole, J. M. ReactionDataExtractor: A Tool for Automated Extraction of Information from Chemical Reaction Schemes. J. Chem. Inf. Model . 61, 4962–4974 (2021)
work page 2021
-
[23]
Mavraˇ ci´ c, J., Court, C. J., Isazawa, T., Elliott, S. R. & Cole, J. M. ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science. J. Chem. Inf. Model . 61, 4280–4289 (2021)
work page 2021
-
[24]
Gupta, T., Zaki, M., Krishnan, N. M. & Mausam. MatSciBERT: A materials domain language model for text mining and information extraction. npj Comput. Mater . 8, 1–11 (2022)
work page 2022
-
[25]
Polak, M. P. & Morgan, D. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat. Commun. 15, 1–11 (2024)
work page 2024
-
[26]
Jiang, X. et al. Applications of natural language processing and large language models in materials discovery. npj Comput. Mater . 11, 1–15 (2025)
work page 2025
-
[27]
Miret, S. & Krishnan, N. M. Enabling large language models for real-world materials discovery. Nat. Mach. Intell . 7, 991–998 (2025)
work page 2025
-
[28]
Polak, M. P. et al. Flexible, model-agnostic method for materials data extraction from text using general purpose language models. Digit. Discov. 3, 1221–1235 (2024)
work page 2024
-
[29]
Lee, S. et al. Data-driven analysis of text-mined seed-mediated syntheses of gold nanoparticles. Digit. Discov. (2024)
work page 2024
-
[30]
Lee, S., Heinen, S., Khan, D. & Anatole von Lilienfeld, O. Autonomous data extraction from peer reviewed literature for training machine learning models of oxidation potentials. Mach. Learn. Sci. Technol. 5, 1–11 (2024)
work page 2024
-
[31]
Zhang, Y. et al. GPTArticleExtractor: An automated workflow for magnetic material database construction. J. Magn. Magn. Mater . 597, 172001 (2024)
work page 2024
-
[32]
Ansari, M. & Moosavi, S. M. Agent-based Learning of Materials Datasets from Scientific Literature. Digit. Discov. 2607–2617 (2024)
work page 2024
-
[33]
Gupta, S., Mahmood, A., Shetty, P., Adeboye, A. & Ramprasad, R. Data extraction from polymer literature using large language models. Commun. Mater. 5, 1–11 (2024)
work page 2024
-
[34]
Yang, Z., Yorke, S. K., Knowles, T. P. & Buehler, M. J. Learning the rules of peptide self-assembly through data mining with large language models. Sci. Adv. 11, 1–11 (2025)
work page 2025
-
[35]
Rihm, S. D. et al. Extraction of chemical synthesis information using the World Avatar. Digit. Discov. (2025)
work page 2025
-
[36]
Shi, Y. et al. Comparison of LLMs in extracting synthesis conditions and generating Q&A datasets for metal-organic frameworks. Digit. Discov. (2025)
work page 2025
-
[37]
Wei, C. et al. Large Language Models Assisted Materials Development: Case of Predictive Analytics for Oxygen Evolution Reaction Catalysts of (Oxy)hydroxides. ACS Sustain. Chem. Eng . (2025)
work page 2025
-
[38]
Zheng, Z., Zhang, O., Borgs, C., Chayes, J. T. & Yaghi, O. M. ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis. J. Am. Chem. Soc . 145, 18048–18062 (2023)
work page 2023
-
[39]
Sipil¨ a, M., Mehryary, F., Pyysalo, S., Ginter, F. & Todorovi´ c, M. Question Answering models for information extraction from perovskite materials science literature. Commun. Mater. 6 (2025)
work page 2025
-
[40]
Odobesku, R. et al. Agent-based multimodal information extraction for nanomaterials. npj Comput. Mater. 11, 1–11 (2025)
work page 2025
-
[41]
Kang, Y. et al. Harnessing Large Language Models to Collect and Analyze Metal-Organic Framework Property Data Set. J. Am. Chem. Soc . 147, 3943–3958 (2025)
work page 2025
- [42]
-
[43]
Circi, D., Khalighinejad, G., Chen, A., Dhingra, B. & Brinson, L. C. How Well Do Large Language Models Understand Tables in Materials Science? Integrating Mater. Manuf. Innov. 13, 669–687 (2024)
work page 2024
-
[44]
Mahjoubi, S. et al. Data-driven material screening of secondary and natural cementitious precursors. 17 Commun. Mater. 6, 99 (2025)
work page 2025
-
[45]
Monteiro, P. J., Miller, S. A. & Horvath, A. Towards sustainable concrete. Nat. Mater . 16, 698–699 (2017)
work page 2017
-
[46]
Technology Roadmap - Low-Carbon Transition in the Cement Industry
International Energy Agency. Technology Roadmap - Low-Carbon Transition in the Cement Industry. Tech. Rep. (2018)
work page 2018
-
[47]
DeRousseau, M. A., Kasprzyk, J. R. & Srubar, W. V. Computational design optimization of concrete mixtures: A review. Cem. Concr. Res . 109, 42–53 (2018)
work page 2018
-
[48]
Buffenbarger, J. K., Casilio, J. M., AzariJafari, H. & Szoke, S. S. Role of Mixture Overdesign in the Sustainability of Concrete: Current State and Future Perspective. ACI Mater. J . 120, 89–100 (2023)
work page 2023
-
[49]
Pfeiffer, O. P. et al. Bayesian design of concrete with amortized Gaussian processes and multi-objective optimization. Cem. Concr. Res . 177, 107406 (2024)
work page 2024
-
[50]
Li, Z. et al. Machine learning in concrete science: applications, challenges, and best practices. npj Comput. Mater. 8, 1–17 (2022)
work page 2022
-
[51]
Li, Z. & Radlinska, A. Artificial intelligence in concrete materials: A scientometric view. In Naser, M. Z. (ed.) Leveraging Artificial Intelligence in Engineering, Management, and Safety of Infrastructure , 161–183 (CRC Press, 2022)
work page 2022
-
[52]
Ben Chaabene, W., Flah, M. & Nehdi, M. L. Machine learning prediction of mechanical properties of concrete: Critical review. Constr. Build. Mater . 260, 119889 (2020)
work page 2020
-
[53]
Nunez, I., Marani, A., Flah, M. & Nehdi, M. L. Estimating compressive strength of modern concrete mixtures using computational intelligence : A systematic review. Constr. Build. Mater . 310, 125279 (2021)
work page 2021
-
[54]
Yeh, I. C. Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 28, 1797–1808 (1998)
work page 1998
-
[55]
Yeh, I.-C. Concrete Compressive Strength [Dataset] (2007). URL https://doi.org/10.24432/C5PK67
-
[56]
A., Hall, A., Pilon, L., Gupta, P
Young, B. A., Hall, A., Pilon, L., Gupta, P. & Sant, G. Can the compressive strength of concrete be estimated from knowledge of the mixture proportions?: New insights from statistical analysis and machine learning methods. Cem. Concr. Res . 115, 379–388 (2019)
work page 2019
-
[57]
A., Laftchiev, E., Kasprzyk, J
DeRousseau, M. A., Laftchiev, E., Kasprzyk, J. R., Rajagopalan, B. & Srubar, W. V. A comparison of machine learning methods for predicting the compressive strength of field-placed concrete. Constr. Build. Mater. 228, 116661 (2019)
work page 2019
-
[58]
Zhang, X., Akber, M. Z. & Zheng, W. Prediction of seven-day compressive strength of field concrete. Constr. Build. Mater . 305, 124604 (2021)
work page 2021
-
[59]
Snellings, R., Mertens, G. & Elsen, J. Supplementary cementitious materials. Rev. Mineral. Geochem. 74, 211–278 (2012)
work page 2012
-
[60]
Juenger, M. C., Snellings, R. & Bernal, S. A. Supplementary cementitious materials: New sources, characterization, and performance insights. Cem. Concr. Res . 122, 257–273 (2019)
work page 2019
-
[61]
Snellings, R., Suraneni, P. & Skibsted, J. Future and emerging supplementary cementitious materials. Cem. Concr. Res . 171 (2023)
work page 2023
-
[62]
ACI Committee 211. 211.1-91 Standard Practice for Selecting Proportions for Normal, Heavyweight, and Mass Concrete (Reapproved 2009) (2002)
work page 2009
-
[63]
Hong, Z., Ward, L., Chard, K., Blaiszik, B. & Foster, I. Challenges and Advances in Information Extraction from Scientific Literature: a Review. Jom 73, 3383–3400 (2021)
work page 2021
- [64]
-
[65]
System Card: Claude Sonnet 4.5
Anthropic. System Card: Claude Sonnet 4.5. Tech. Rep. September (2025)
work page 2025
- [66]
-
[67]
Jiang, Y. et al. Prediction of time-dependent concrete mechanical properties based on advanced deep learning models considering complex variables. Case Stud. Constr. Mater . 21, e03629 (2024)
work page 2024
-
[68]
Imran, M., Khushnood, R. A. & Fawad, M. A hybrid data-driven and metaheuristic optimization approach for the compressive strength prediction of high-performance concrete. Case Stud. Constr. Mater. 18, e01890 (2023)
work page 2023
-
[69]
Liu, X., Mei, S., Wang, X. & Li, X. Estimation of compressive strength of concrete with manufactured sand and natural sand using interpretable artificial intelligence. Case Stud. Constr. Mater . 21, e03840 18 (2024)
work page 2024
-
[70]
Mohammadi Golafshani, E., Arashpour, M. & Behnood, A. Predicting the compressive strength of green concretes using Harris hawks optimization-based data-driven methods. Constr. Build. Mater . 318, 125944 (2022)
work page 2022
-
[71]
Golafshani, E. M., Behnood, A. & Arashpour, M. Predicting the compressive strength of normal and High-Performance Concretes using ANN and ANFIS hybridized with Grey Wolf Optimizer. Constr. Build. Mater. 232, 117266 (2020)
work page 2020
-
[72]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Preprint at http://arxiv.org/abs/1802.03426 (2020)
work page internal anchor Pith review arXiv 2020
-
[73]
Li, Z. et al. Can domain knowledge benefit machine learning for concrete property prediction? J. Am. Ceram. Soc. 107, 1582–1602 (2024)
work page 2024
-
[74]
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 785–794 (2016)
work page 2016
- [75]
-
[76]
Ke, G. et al. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, 3147–3155 (2017)
work page 2017
-
[77]
Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
work page 1986
-
[78]
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273–297 (1995)
work page 1995
-
[79]
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems , 4768–4777 (2017)
work page 2017
-
[80]
Xie, T. & Visintin, P. A unified approach for mix design of concrete containing supplementary cementitious materials based on reactivity moduli. J. Clean. Prod. 203, 68–82 (2018)
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.