pith. sign in

arxiv: 2606.21830 · v1 · pith:4Q5TQJSZnew · submitted 2026-06-20 · 💻 cs.LG

Mat-Pref: Verifiable-Reward Training Improves Compositional Reasoning in Inorganic Materials

Pith reviewed 2026-06-26 12:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords reinforcement learningmaterials sciencecompositional reasoningionic substitutionbenchmarkpolicy optimizationlanguage model fine-tuningstructure generalization
0
0 comments X

The pith

Verifiable-reward training on a materials benchmark lets an 8B model outperform 235B models on compositional reasoning about inorganic structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mat-Pref, a benchmark of ionic-substitution questions drawn from density functional theory data across eleven structure families, with splits designed to separate in-distribution performance, generalization to entirely new structure families, and cross-property transfer. Zero-shot frontier models from 70B to 671B parameters stay in the 33-54 percent range on all splits. A two-stage process of supervised fine-tuning followed by Group Relative Policy Optimization raises Qwen3-8B to 65.2 percent in-distribution and 71.6 percent on held-out families, exceeding the zero-shot 235B model by more than twenty points. The gains occur because the reinforcement step increases the chance that correct answers become the model's most likely output rather than merely reachable through sampling.

Core claim

A two-stage pipeline of supervised fine-tuning followed by Group Relative Policy Optimization lifts Qwen3-8B to 65.2 percent in-distribution and 71.6 percent on held-out families, exceeding zero-shot Qwen3-235B by over 20 percentage points on both structural-generalization splits. Self-consistency sampling shows that the SFT policy can already produce correct answers but cannot reliably surface them as the modal response; GRPO reshapes the distribution so that correct answers become modal rather than merely reachable, and this sharper commitment is visible mechanistically through logit lens analysis revealing a ~20pp advantage in answer crystallization at the critical decision layer. The pap

What carries the argument

Group Relative Policy Optimization (GRPO) applied after supervised fine-tuning, which reshapes the output distribution so correct answers become the modal response.

If this is right

  • The post-GRPO model makes correct answers the modal response rather than merely reachable through sampling.
  • Structural generalization to entirely held-out crystal structure families improves substantially.
  • Cross-property transfer, such as applying band-gap reasoning to hosts seen only through formation-energy supervision, becomes viable.
  • Logit lens analysis shows a roughly 20 percentage point advantage in answer crystallization at the critical decision layer.
  • The distractor-permutation consistency metric narrows the gap between lenient and strict scoring from 24.0 to 14.3 percentage points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The verifiable-reward pipeline could be tested on other scientific domains that supply simulation or database ground truth, such as molecular property prediction.
  • The held-out-family split design offers a template for diagnosing memorization versus generalization in other language-model reasoning benchmarks.
  • Logit lens diagnostics might help identify which post-training methods reliably produce modal correct answers across tasks.
  • If the gains persist under stricter leakage controls, the result would favor investing in targeted post-training over further scale alone for scientific applications.

Load-bearing premise

The three evaluation splits isolate true structural generalization and cross-property transfer without the model having memorized specific compound-property pairs or benefited from leakage in question generation.

What would settle it

If the same two-stage training is repeated on a version of the benchmark in which structure-family labels are randomly scrambled and the performance advantage over zero-shot baselines disappears, the claim that the model learned compositional reasoning would be falsified.

Figures

Figures reproduced from arXiv: 2606.21830 by Jeongbin Park, Sarrah R. Mikhail Leung, Taehan Kim.

Figure 1
Figure 1. Figure 1: Mat-Pref benchmark structure. Top: 11 inorganic structure families, with 8 used for training (above the dashed line) and 3 held out entirely for OOD-host evaluation (below). Bottom: three test splits. IID: novel hosts from training families. OOD-host: entirely held-out families (garnet, halide perovskite, NASICON). OOD-property: training-family hosts evaluated on the held-out band-gap property (template gr… view at source ↗
Figure 2
Figure 2. Figure 2: MAT-PREF construction pipeline. Entries are filtered by stability, classified into structure families (anonymous formula + anion + space-group filters), assigned oxidation states, validated at the site level (ChemEnv CN rules or ionic-radius fallback), and converted into multiple-choice questions with forward/reverse goals and gap-based filtering. The 10,837 questions are stratified by template group into … view at source ↗
Figure 3
Figure 3. Figure 3: MAT-PREF at a glance. Middle: each question specifies a host, a crystallographic site, a design goal, and three or four candidate substitutions; the model selects the best candidate and justifies its reasoning. Left: a correct choice composes several chemical principles, including charge balance, ionic radius, coordination preference, and bond strength, that no single heuristic captures. Right: every candi… view at source ↗
Figure 4
Figure 4. Figure 4: GRPO training dynamics. (a) Question-level train￾ing accuracy rises from 55% to 80% (7-step moving average). Dashed lines show reference baselines: random (25.8%), Qwen2.5- 72B zero-shot (37.6%), SFT (41.3%), and Qwen3-235B zero-shot (53.3%). (b) KL divergence from the SFT reference policy stays below 0.025 nats. Response length and entropy stability are re￾ported in [PITH_FULL_IMAGE:figures/full_fig_p006… view at source ↗
Figure 5
Figure 5. Figure 5: Full GRPO training dynamics (4-panel). (a) Training accuracy. (b) Mean response length remains stable at ∼450 tokens. (c) Policy entropy decreases from 0.59 to 0.58. (d) KL divergence from SFT reference. Training uses batch size 64, group size 8, temperature 0.8, and LoRA rank 32. The total pipeline cost (trace generation + SFT + GRPO) is approximately $50. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Logit-lens crystallization rate vs. layer. Both models transition from random (∼22%) to committed at layer 24; GRPO sustains a ∼20pp advantage thereafter. probes (L2, C=1.0) on the 4,096-dimensional post-block residual stream at the last prompt token, using a stratified 80/20 split of 3,334 test questions at each layer. Probes for target property, goal direction, and structure family reach 100% accuracy fo… view at source ↗
read the original abstract

Reinforcement learning from verifiable rewards (RLVR) has driven rapid progress in mathematical and code reasoning, but when extended to science, existing benchmarks do not decompose what generalizes: do gains reflect structural transfer, property transfer, or memorization? We introduce Mat-Pref, a benchmark of 10,837 ionic-substitution questions across 11 inorganic structure families, grounded in density functional theory calculations from the Materials Project, with three evaluation splits that isolate in-distribution performance, generalization to entirely held-out structure families, and cross-property transfer: applying band-gap reasoning to hosts seen during training only through formation-energy supervision. Four zero-shot frontier models (70-671B parameters) remain in the 33-54% range on every split, confirming that scale alone does not resolve the compositional chemical reasoning this task demands. A two-stage pipeline of supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) lifts Qwen3-8B to 65.2% in-distribution and 71.6% on held-out families, exceeding zero-shot Qwen3-235B by over 20 percentage points on both structural-generalization splits. Self-consistency sampling shows that the SFT policy can already produce correct answers but cannot reliably surface them as the modal response; GRPO reshapes the distribution so that correct answers become modal rather than merely reachable, and this sharper commitment is visible mechanistically: logit lens analysis reveals a ${\sim}$20pp advantage in answer crystallization at the critical decision layer. We formalize this observation as a distractor-permutation consistency metric under which GRPO narrows the gap between lenient scoring (at least one permutation correct) and strict scoring (all permutations correct) from 24.0 to 14.3 percentage points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Mat-Pref, a benchmark of 10,837 ionic-substitution questions derived from Materials Project DFT data across 11 inorganic structure families. It defines three evaluation splits to separately measure in-distribution performance, generalization to held-out structure families, and cross-property transfer (e.g., band-gap reasoning on hosts seen only via formation-energy training). The central empirical claim is that supervised fine-tuning followed by Group Relative Policy Optimization (GRPO) on Qwen3-8B yields 65.2% accuracy in-distribution and 71.6% on held-out families, outperforming zero-shot Qwen3-235B by more than 20 points on the generalization splits. The paper further analyzes the effect of GRPO using self-consistency sampling and logit-lens techniques, introducing a distractor-permutation consistency metric that quantifies improved answer crystallization.

Significance. Should the evaluation protocol prove robust against memorization and leakage, the results would indicate that verifiable-reward RL methods can drive substantial gains in compositional scientific reasoning where model scale alone is insufficient. The mechanistic evidence from logit lens and the distractor-permutation consistency metric provide concrete, falsifiable insights into how GRPO reshapes the output distribution. The benchmark construction from verifiable DFT data is a strength that supports reproducible evaluation in the materials domain.

major comments (2)
  1. [Benchmark construction and evaluation splits section] The isolation of the held-out structure families split from training data is central to the generalization claim, yet the manuscript provides no quantitative statistics on compound overlap, ionic substitution template overlap, or potential leakage from pretraining corpora containing Materials Project entries. Without these controls, the 71.6% held-out accuracy cannot be unambiguously attributed to compositional reasoning rather than exploitation of statistical regularities in the question generation process.
  2. [Results and experimental details] The reported accuracies (65.2% in-distribution, 71.6% held-out) are presented as point estimates without statistical error bars, variance across random seeds, or explicit description of data exclusion rules and question-generation hyperparameters, which are required to establish that the gains over zero-shot baselines are reliable.
minor comments (2)
  1. [Abstract] The abstract omits the number of questions per split and any concrete description of how the cross-property transfer split is generated.
  2. [Analysis of GRPO effects] The distractor-permutation consistency metric would benefit from an explicit mathematical definition or worked example in the main text rather than relying solely on the reported 24.0-to-14.3 pp narrowing.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback emphasizing the need for stronger controls on leakage and statistical reliability. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Benchmark construction and evaluation splits section] The isolation of the held-out structure families split from training data is central to the generalization claim, yet the manuscript provides no quantitative statistics on compound overlap, ionic substitution template overlap, or potential leakage from pretraining corpora containing Materials Project entries. Without these controls, the 71.6% held-out accuracy cannot be unambiguously attributed to compositional reasoning rather than exploitation of statistical regularities in the question generation process.

    Authors: We agree that quantitative overlap statistics are required to support the generalization claims. In the revised manuscript we will add tables reporting compound overlap percentages and ionic substitution template overlap between training and held-out structure families. We will also include a discussion of Materials Project entry overlap with likely pretraining sources. Complete verification against proprietary pretraining corpora remains limited. revision: partial

  2. Referee: [Results and experimental details] The reported accuracies (65.2% in-distribution, 71.6% held-out) are presented as point estimates without statistical error bars, variance across random seeds, or explicit description of data exclusion rules and question-generation hyperparameters, which are required to establish that the gains over zero-shot baselines are reliable.

    Authors: We agree that variance estimates and hyperparameter details are necessary. The revised version will report accuracies with error bars over multiple random seeds, state the data exclusion rules explicitly, and provide the full set of question-generation hyperparameters. revision: yes

standing simulated objections not resolved
  • Exhaustive checks for leakage from all pretraining corpora of closed models are not feasible without access to their training data.

Circularity Check

0 steps flagged

No significant circularity; claims rest on new benchmark and empirical runs

full rationale

The paper introduces a new benchmark (Mat-Pref) with explicitly constructed splits from Materials Project data and reports empirical accuracies from SFT+GRPO training on Qwen3-8B versus baselines. No derivation chain reduces a claimed result to a fitted parameter, self-citation, or ansatz by construction; the distractor-permutation consistency metric is presented as a post-hoc formalization of observed logit-lens behavior rather than a load-bearing prediction. The central performance numbers (65.2%, 71.6%) are direct experimental outputs, not quantities defined in terms of themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified assumption that the ionic-substitution questions and DFT labels from Materials Project cleanly test compositional reasoning without data leakage or memorization artifacts; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The benchmark questions and splits isolate structural generalization and cross-property transfer without confounding memorization or leakage.
    Invoked by the claim that the three splits measure distinct forms of generalization.

pith-pipeline@v0.9.1-grok · 5860 in / 1259 out tokens · 24660 ms · 2026-06-26T12:37:40.495386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 11 canonical work pages · 9 internal anchors

  1. [1]

    , title =

    Jain, Anubhav and Ong, Shyue Ping and Hautier, Geoffroy and Chen, Wei and Richards, William Davidson and Dacek, Stephen and Cholia, Shreyas and Gunter, Dan and Skinner, David and Ceder, Gerbrand and Persson, Kristin A. , title =. APL Materials , volume =

  2. [2]

    and Rignanese, Gian-Marco and Gonze, Xavier and Hautier, Geoffroy , title =

    Waroquiers, David and George, Julie and Horton, Matthew and Schenk, Stephan and Persson, Kristin A. and Rignanese, Gian-Marco and Gonze, Xavier and Hautier, Geoffroy , title =. Acta Crystallographica Section B , volume =

  3. [4]

    and Li, Y

    Guo, Daya and Zhu, Qihao and Yang, Dejian and Xie, Zhenda and Dong, Kai and Zhang, Wentao and Chen, Guanting and Bi, Xiao and Wu, Y. and Li, Y. K. and Luo, Fuli and Xiong, Yingfei and Liang, Wenfeng , journal=

  4. [5]

    and Persson, Kristin A

    Ong, Shyue Ping and Richards, William Davidson and Jain, Anubhav and Hautier, Geoffroy and Kocher, Michael and Cholia, Shreyas and Gunter, Dan and Chevrier, Vincent L. and Persson, Kristin A. and Ceder, Gerbrand , title =. Computational Materials Science , volume =

  5. [6]

    and Ong, Shyue Ping and Hautier, Geoffroy and Jain, Anubhav and Richards, William Davidson and Gamst, Alex C

    Sun, Wenhao and Dacek, Stephen T. and Ong, Shyue Ping and Hautier, Geoffroy and Jain, Anubhav and Richards, William Davidson and Gamst, Alex C. and Persson, Kristin A. and Ceder, Gerbrand , title =. Science Advances , volume =

  6. [7]

    Humanity's Last Exam

    Humanity's last exam , author=. arXiv preprint arXiv:2501.14249 , year=

  7. [8]

    2025 , month = jul, howpublished =

    Skarlinski, Michael and Laurent, Jon and Bou, Albert and White, Andrew , title =. 2025 , month = jul, howpublished =

  8. [9]

    Humanity's Last Exam (HLE) Bio/Chem Gold , year =

  9. [10]

    Benchmarking materials property prediction methods: the

    Dunn, Alexander and Wang, Qi and Ganose, Alex and Dopp, Daniel and Jain, Anubhav , journal=. Benchmarking materials property prediction methods: the. 2020 , publisher=

  10. [11]

    2023 , publisher=

    Song, Yu and Miret, Santiago and Liu, Bang , booktitle=. 2023 , publisher=

  11. [12]

    Advances in Neural Information Processing Systems , volume=

    Training a Scientific Reasoning Model for Chemistry , author=. Advances in Neural Information Processing Systems , volume=

  12. [13]

    and Torkar, Michaela and Li, Donghui and Karaletsos, Theofanis , booktitle=

    Istrate, Ana-Maria and Milletari, Fausto and Castrotorres, Fabrizio and Tomczak, Jakub M. and Torkar, Michaela and Li, Donghui and Karaletsos, Theofanis , booktitle=. rbio1 -- Training Scientific Reasoning

  13. [14]

    2025 , month = jul, howpublished =

    About 30\. 2025 , month = jul, howpublished =

  14. [15]

    NeurIPS 2025 Workshop on AI for Science , year=

    Towards Generating Stable Materials via Large Language Models with Reinforcement Learning Finetuning , author=. NeurIPS 2025 Workshop on AI for Science , year=

  15. [17]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=

  16. [20]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal=. The

  17. [21]

    DeepSeek-AI , journal=

  18. [22]

    Advances in Neural Information Processing Systems , year=

    Eliciting Latent Predictions from Transformers with the Tuned Lens , author=. Advances in Neural Information Processing Systems , year=

  19. [23]

    Eliciting latent predictions from transformers with the tuned lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. In Advances in Neural Information Processing Systems, 2023

  20. [24]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2024

  21. [25]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1 : Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  22. [26]

    Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm

    Alexander Dunn, Qi Wang, Alex Ganose, Daniel Dopp, and Anubhav Jain. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Computational Materials, 6 0 (1): 0 138, 2020. doi:10.1038/s41524-020-00406-3

  23. [27]

    Humanity's last exam (hle) bio/chem gold

    FutureHouse . Humanity's last exam (hle) bio/chem gold. https://huggingface.co/datasets/futurehouse/hle-gold-bio-chem, 2025 a . Hugging Face dataset card, accessed April 18, 2026

  24. [28]

    About 30\ are likely wrong

    FutureHouse . About 30\ are likely wrong. https://www.futurehouse.org/research-announcements/hle-exam, July 2025 b . Blog post, accessed April 19, 2026

  25. [29]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  26. [30]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. DeepSeek-Coder : When the large language model meets programming -- the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024

  27. [31]

    Towards generating stable materials via large language models with reinforcement learning finetuning

    Zhang-Wei Hong, Nofit Segal, Aviv Netanyahu, Hoje Chun, Rafael Gomez-Bombarelli, and Pulkit Agrawal. Towards generating stable materials via large language models with reinforcement learning finetuning. In NeurIPS 2025 Workshop on AI for Science, 2025

  28. [32]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9

  29. [33]

    Tomczak, Michaela Torkar, Donghui Li, and Theofanis Karaletsos

    Ana-Maria Istrate, Fausto Milletari, Fabrizio Castrotorres, Jakub M. Tomczak, Michaela Torkar, Donghui Li, and Theofanis Karaletsos. rbio1 -- training scientific reasoning LLMs with biological world models as soft verifiers. In NeurIPS 2025 Workshop on AI Virtual Cells and Instruments: A New Era in Drug Discovery and Development, 2025

  30. [34]

    Anubhav Jain, Shyue Ping Ong, Geoffroy Hautier, Wei Chen, William Davidson Richards, Stephen Dacek, Shreyas Cholia, Dan Gunter, David Skinner, Gerbrand Ceder, and Kristin A. Persson. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL Materials, 1 0 (1): 0 011002, 2013

  31. [35]

    AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

    Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Bram Hoex, Zhicheng Zhong, and Tong Xie. Atomworld: A benchmark for evaluating spatial reasoning in large language models on crystalline materials. arXiv preprint arXiv:2510.04704, 2025

  32. [36]

    Narayanan, James D

    Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi P. Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, and Andrew D. White. Training a scientific reasoning model for chemistry. In Advances in Neural Information Processing Systems, volume 38, 2025

  33. [37]

    Chevrier, Kristin A

    Shyue Ping Ong, William Davidson Richards, Anubhav Jain, Geoffroy Hautier, Michael Kocher, Shreyas Cholia, Dan Gunter, Vincent L. Chevrier, Kristin A. Persson, and Gerbrand Ceder. Python materials genomics (pymatgen): A robust, open-source python library for materials analysis. Computational Materials Science, 68: 0 314--319, 2013

  34. [38]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  35. [39]

    MatSci-NLP : Evaluating scientific language models on materials science language tasks using text-to-schema modeling

    Yu Song, Santiago Miret, and Bang Liu. MatSci-NLP : Evaluating scientific language models on materials science language tasks using text-to-schema modeling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3621--3639, Toronto, Canada, 2023. Association for Computational Linguistics. ...

  36. [40]

    Dacek, Shyue Ping Ong, Geoffroy Hautier, Anubhav Jain, William Davidson Richards, Alex C

    Wenhao Sun, Stephen T. Dacek, Shyue Ping Ong, Geoffroy Hautier, Anubhav Jain, William Davidson Richards, Alex C. Gamst, Kristin A. Persson, and Gerbrand Ceder. The thermodynamic scale of inorganic crystalline metastability. Science Advances, 2 0 (11): 0 e1600225, 2016

  37. [41]

    Persson, Gian-Marco Rignanese, Xavier Gonze, and Geoffroy Hautier

    David Waroquiers, Julie George, Matthew Horton, Stephan Schenk, Kristin A. Persson, Gian-Marco Rignanese, Xavier Gonze, and Geoffroy Hautier. Chemenv: a fast and robust coordination environment identification tool. Acta Crystallographica Section B, 76 0 (4): 0 683--695, 2020

  38. [42]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

  39. [43]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2025 b