pith. machine review for the scientific record. sign in

arxiv: 2604.03361 · v1 · submitted 2026-04-03 · 💻 cs.LG · q-bio.QM

Recognition: no theorem link

The limits of bio-molecular modeling with large language models : a cross-scale evaluation

Fengwei An, Tianyu Zhao, Yaxin Xu, Yue Zhou, Zhixiang Ren

Pith reviewed 2026-05-13 20:06 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM
keywords bio-molecular modelinglarge language modelsbenchmark evaluationchain-of-thought promptinghybrid mamba-attentionsupervised fine-tuningclassification versus regression
0
0 comments X

The pith

A 26-task benchmark reveals large language models remain weak on bio-molecular regression despite strengths in classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds BioMol-LLM-Bench as a unified evaluation framework with 26 tasks spanning four difficulty levels and integrated computational tools to test LLMs on bio-molecular problems across scales. Testing thirteen representative models produces four concrete findings about their behavior. Chain-of-thought prompting delivers little gain and can lower accuracy on biological tasks. Hybrid architectures that combine mamba and attention layers handle long sequences more effectively than pure transformers. Supervised fine-tuning sharpens performance on narrow tasks while eroding broader generalization. Models classify bio-molecular properties reliably but falter on demanding regression predictions that require quantitative mechanistic insight.

Core claim

BioMol-LLM-Bench evaluation of thirteen LLMs demonstrates systematic gaps between model outputs and mechanistic understanding of multi-scale bio-molecular systems, shown through limited or negative effects of chain-of-thought data, advantages of hybrid mamba-attention architectures on long sequences, specialization-generalization trade-offs after supervised fine-tuning, and reliable classification paired with persistent weakness on regression tasks.

What carries the argument

BioMol-LLM-Bench, the proposed cross-scale benchmark framework consisting of 26 downstream tasks at four difficulty levels with tool augmentation.

If this is right

  • Chain-of-thought data should be used sparingly or omitted for biological tasks to avoid performance losses.
  • Hybrid mamba-attention models merit priority when processing extended bio-molecular sequences.
  • Supervised fine-tuning requires safeguards to retain generalization across molecular scales.
  • Current LLMs suit classification work on bio-molecular properties but require further advances for accurate regression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that embed explicit physical constraints could close the regression gap left by pure language modeling.
  • Expanding the benchmark with direct molecular-dynamics trajectories would test whether the observed limits hold under more mechanistic conditions.
  • Training mixtures that interleave experimental measurements with simulation data might reduce the specialization-generalization trade-off.

Load-bearing premise

The twenty-six chosen tasks sufficiently represent the mechanistic challenges of real multi-scale bio-molecular modeling.

What would settle it

An LLM that matches or exceeds baseline accuracy on held-out regression tasks such as quantitative prediction of binding free energies or reaction rates within the same benchmark setup would directly challenge the reported weakness.

read the original abstract

The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BioMol-LLM-Bench, a unified benchmark with 26 downstream tasks spanning 4 difficulty levels for cross-scale bio-molecular modeling. It evaluates 13 representative LLMs and reports four findings: chain-of-thought data offers limited or negative benefit on biological tasks; hybrid mamba-attention architectures outperform on long sequences; supervised fine-tuning boosts specialization at the expense of generalization; and LLMs excel at classification but struggle with challenging regression tasks. The authors conclude this reveals a systematic gap between LLM performance and mechanistic understanding.

Significance. If the benchmark tasks genuinely probe multi-scale biophysical mechanisms rather than surface statistics, the empirical results across diverse models would provide actionable guidance for LLM architectures and training strategies in molecular biology and drug discovery. The broad model coverage is a positive aspect of the evaluation.

major comments (2)
  1. [Benchmark construction] The central claim of a 'systematic gap between LLM performance and mechanistic understanding' depends on the 26 tasks in BioMol-LLM-Bench requiring capture of physical cross-scale phenomena. The paper groups tasks into four difficulty levels but provides no explicit mapping demonstrating that higher levels enforce biophysical constraints such as energy conservation, force-field consistency, or long-range allostery (see abstract and benchmark construction description).
  2. [Abstract] Abstract: The description of the benchmark and findings omits details on task selection criteria, the statistical tests used to support the four conclusions, and error bars on reported performance metrics, which limits assessment of the robustness of the observed gaps.
minor comments (1)
  1. [Results] The four findings are listed clearly in the abstract but would be strengthened by explicit quantitative comparisons (e.g., performance deltas) in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback highlights important opportunities to strengthen the connection between benchmark tasks and biophysical principles as well as to improve clarity in the abstract. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [Benchmark construction] The central claim of a 'systematic gap between LLM performance and mechanistic understanding' depends on the 26 tasks in BioMol-LLM-Bench requiring capture of physical cross-scale phenomena. The paper groups tasks into four difficulty levels but provides no explicit mapping demonstrating that higher levels enforce biophysical constraints such as energy conservation, force-field consistency, or long-range allostery (see abstract and benchmark construction description).

    Authors: We appreciate this observation. The difficulty levels were designed to progressively incorporate tasks that demand modeling of cross-scale interactions (e.g., level 3–4 tasks include multi-domain proteins and allosteric effects), which in practice require capturing biophysical consistency beyond surface statistics. However, we acknowledge that an explicit mapping table linking each level to specific constraints such as energy conservation or force-field consistency was not included. We will add a dedicated subsection (and accompanying table) in the revised benchmark construction section that explicitly maps task levels to the biophysical principles they probe, with concrete examples drawn from the 26 tasks. This will directly support the central claim. revision: yes

  2. Referee: [Abstract] Abstract: The description of the benchmark and findings omits details on task selection criteria, the statistical tests used to support the four conclusions, and error bars on reported performance metrics, which limits assessment of the robustness of the observed gaps.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will expand the abstract to briefly note: (i) task selection criteria (coverage across molecular scales from sequence to structure-function with four graded difficulty levels), (ii) the statistical tests employed (paired t-tests and Wilcoxon rank-sum tests for model comparisons, with p-values reported in the main text), and (iii) that all performance metrics include error bars (standard deviation across three random seeds, shown in Figures 2–5). These details are already present in the methods and results sections; the abstract revision will make them visible at a glance without exceeding length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces BioMol-LLM-Bench as a new collection of 26 tasks across difficulty levels and reports performance of 13 external LLMs on them. All four main findings are direct observations from these runs (e.g., CoT benefit, architecture comparisons, SFT effects, classification vs. regression gaps). No equations, fitted parameters, or predictions are defined in terms of the target results; the benchmark tasks and metrics are external to any model output. Self-citations, if present, are not load-bearing for any derivation. The evaluation is therefore self-contained against external models and tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen tasks represent key bio-molecular challenges; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 26 tasks in BioMol-LLM-Bench adequately represent multi-scale bio-molecular modeling problems
    Benchmark construction and all performance claims depend on this premise.

pith-pipeline@v0.9.0 · 5481 in / 1134 out tokens · 43849 ms · 2026-05-13T20:06:15.140863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 8 internal anchors

  1. [1]

    Current opinion in chemical biology8(1), 91–97 (2004)

    Kortemme, T., Baker, D.: Computational design of protein–protein interactions. Current opinion in chemical biology8(1), 91–97 (2004)

  2. [2]

    science 319(5867), 1215–1220 (2008)

    Gibson, D.G., Benders, G.A., Andrews-Pfannkoch, C., Denisova, E.A., Baden-Tillson, H., Zaveri, J., Stockwell, T.B., Brownley, A., Thomas, D.W., Algire, M.A.,et al.: Complete chemical synthesis, assembly, and cloning of a mycoplasma genitalium genome. science 319(5867), 1215–1220 (2008)

  3. [3]

    Nature459(7244), 239–242 (2009)

    Powner, M.W., Gerland, B., Sutherland, J.D.: Synthesis of activated pyrimidine ribonu- cleotides in prebiotically plausible conditions. Nature459(7244), 239–242 (2009)

  4. [4]

    Science335(6070), 831–834 (2012)

    Douglas, S.M., Bachelet, I., Church, G.M.: A logic-gated nanorobot for targeted transport of molecular payloads. Science335(6070), 831–834 (2012)

  5. [5]

    Nature428(6982), 487–492 (2004)

    Langer, R., Tirrell, D.A.: Designing materials for biology and medicine. Nature428(6982), 487–492 (2004)

  6. [6]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  7. [7]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  8. [8]

    Gemma: Open Models Based on Gemini Research and Technology

    Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi`ere, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)

  9. [9]

    Nature 645(8081), 633–638 (2025)

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645(8081), 633–638 (2025)

  10. [10]

    Qwen Technical Report

    Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

  11. [11]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., 15 Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  12. [12]

    arXiv preprint arXiv:2508.14444 (2025)

    Basant, A., Khairnar, A., Paithankar, A., Khattar, A., Renduchintala, A., Malte, A., Bercovich, A., Hazare, A., Rico, A., Ficek, A., et al.: Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model. arXiv preprint arXiv:2508.14444 (2025)

  13. [13]

    Phi-4 Technical Report

    Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripi, M., Kauffmann, P., et al.: Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024)

  14. [14]

    M., Cox, S., Schilter, O., Baldassari, C., White, A

    Bran, A.M., Cox, S., Schilter, O., Baldassari, C., White, A.D., Schwaller, P.: Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376 (2023)

  15. [15]

    Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)

  16. [16]

    arXiv preprint arXiv:2504.06196 (2025)

    Wang, E., Schmidgall, S., Jaeger, P.F., Zhang, F., Pilgrim, R., Matias, Y ., Barral, J., Fleet, D., Azizi, S.: Txgemma: Efficient and agentic llms for therapeutics. arXiv preprint arXiv:2504.06196 (2025)

  17. [17]

    In: Findings of the Association for Computational Linguistics: ACL 2024, pp

    Pei, Q., Wu, L., Gao, K., Liang, X., Fang, Y ., Zhu, J., Xie, S., Qin, T., Yan, R.: Biot5+: Towards generalized biological understanding with iupac integration and multi-task tuning. In: Findings of the Association for Computational Linguistics: ACL 2024, pp. 1216–1240 (2024)

  18. [18]

    arXiv e-prints, 2502 (2025)

    Xia, Y ., Jin, P., Xie, S., He, L., Cao, C., Luo, R., Liu, G., Wang, Y ., Liu, Z., Chen, Y .-J., et al.: Naturelm: Deciphering the language of nature for scientific discovery. arXiv e-prints, 2502 (2025)

  19. [19]

    Nature Machine Intelligence7(7), 1154–1167 (2025)

    Zhuang, X., Ding, K., Lyu, T., Jiang, Y ., Li, X., Xiang, Z., Wang, Z., Qin, M., Feng, K., Wang, J.,et al.: Advancing biomolecular understanding and design following human instructions. Nature Machine Intelligence7(7), 1154–1167 (2025)

  20. [20]

    In: First Conference on Language Modeling

    Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y ., Dirani, J., Michael, J., Bow- man, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: First Conference on Language Modeling

  21. [21]

    Advances in Neural Information Processing Systems37, 95266–95290 (2024)

    Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z.,et al.: Mmlu-pro: A more robust and challenging multi-task language understand- ing benchmark. Advances in Neural Information Processing Systems37, 95266–95290 (2024)

  22. [22]

    arXiv preprint arXiv:2307.10635

    Wang, X., Hu, Z., Lu, P., Zhu, Y ., Zhang, J., Subramaniam, S., Loomba, A.R., Zhang, S., Sun, Y ., Wang, W.: Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635 (2023) 16

  23. [23]

    In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

    Sun, L., Han, Y ., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, L., Yu, K.: Scieval: A multi- level large language model evaluation benchmark for scientific research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19053–19061 (2024)

  24. [24]

    In: Proceedings of Th e 10th International Conference on Artificial Intelligence and Soft Computing, Sydney, Australia (2024)

    Olea, C., Tucker, H., Phelan, J., Pattison, C., Zhang, S., Lieb, M., Schmidt, D., White, J.: Evaluating persona prompting for question answering tasks. In: Proceedings of Th e 10th International Conference on Artificial Intelligence and Soft Computing, Sydney, Australia (2024)

  25. [25]

    Humanity's Last Exam

    Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C.B.C., Shaaban, M., Ling, J., Shi, S., et al.: Humanity’s last exam. arXiv preprint arXiv:2501.14249 (2025)

  26. [26]

    International Journal on Digital Libraries 23(3), 289–301 (2022)

    Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries 23(3), 289–301 (2022)

  27. [27]

    arXiv preprint arXiv:2407.10362 (2024)

    Laurent, J.M., Janizek, J.D., Ruzo, M., Hinks, M.M., Hammerling, M.J., Narayanan, S., Ponnapati, M., White, A.D., Rodriques, S.G.: Lab-bench: Measuring capabilities of language models for biology research. arXiv preprint arXiv:2407.10362 (2024)

  28. [28]

    In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp

    Shen, Y ., Chen, Z., Mamalakis, M., He, L., Xia, H., Li, T., Su, Y ., He, J., Wang, Y .G.: A fine-tuning dataset and benchmark for large language models for protein understanding. In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2390–2395 (2024). IEEE

  29. [29]

    Bioinformatics26(23), 3000–3001 (2010)

    Walker, T., Grulke, C.M., Pozefsky, D., Tropsha, A.: Chembench: a cheminformatics workbench. Bioinformatics26(23), 3000–3001 (2010)

  30. [30]

    arXiv preprint arXiv:2402.09391 (2024)

    Yu, B., Baker, F.N., Chen, Z., Ning, X., Sun, H.: Llasmol: Advancing large language mod- els for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391 (2024)

  31. [31]

    Chemical science 9(2), 513–530 (2018)

    Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S., Leswing, K., Pande, V .: Moleculenet: a benchmark for molecular machine learning. Chemical science 9(2), 513–530 (2018)

  32. [32]

    arXiv preprint arXiv:2310.00115 (2023)

    Zhu, Y ., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y ., Chauhan, J., Wiest, O., Isayev, O., et al.: Learning over molecular conformer ensembles: Datasets and benchmarks. arXiv preprint arXiv:2310.00115 (2023)

  33. [33]

    Advances in neural information processing systems32(2019)

    Rao, R., Bhattacharya, N., Thomas, N., Duan, Y ., Chen, P., Canny, J., Abbeel, P., Song, Y .: Evaluating protein transfer learning with tape. Advances in neural information processing systems32(2019)

  34. [34]

    bioRxiv, 2021–11 (2021)

    Dallago, C., Mou, J., Johnston, K.E., Wittmann, B.J., Bhattacharya, N., Goldman, S., Madani, A., Yang, K.K.: Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021–11 (2021)

  35. [35]

    Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

    Gao, S., Zhu, R., Kong, Z., Noori, A., Su, X., Ginder, C., Tsiligkaridis, T., Zitnik, M.: Txagent: an ai agent for therapeutic reasoning across a universe of tools. arXiv preprint 17 arXiv:2503.10970 (2025)

  36. [36]

    Nature Computational Science 5(10), 962–972 (2025)

    Ding, K., Yu, J., Huang, J., Yang, Y ., Zhang, Q., Chen, H.: Scitoolagent: a knowledge- graph-driven scientific agent for multitool integration. Nature Computational Science 5(10), 962–972 (2025)

  37. [37]

    Advances in neural information processing systems 36, 64331–64379 (2023)

    Notin, P., Kollasch, A., Ritter, D., Van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Orenbuch, R., Weitzman, R.,et al.: Proteingym: Large-scale benchmarks for protein fitness prediction and design. Advances in neural information processing systems 36, 64331–64379 (2023)

  38. [38]

    Nucleic acids research50(W1), 228–234 (2022)

    Thumuluri, V ., Almagro Armenteros, J.J., Johansen, A.R., Nielsen, H., Winther, O.: Deeploc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic acids research50(W1), 228–234 (2022)

  39. [39]

    Bioinformatics36(22-23), 5545–5547 (2020)

    Huang, K., Fu, T., Glass, L.M., Zitnik, M., Xiao, C., Sun, J.: Deeppurpose: a deep learn- ing library for drug–target interaction prediction. Bioinformatics36(22-23), 5545–5547 (2020)

  40. [40]

    arXiv preprint arXiv:2506.04235 (2025)

    Zhao, X., Tang, Y .-C., Singh, A., Cantu, V .J., An, K., Lee, J., Stogsdill, A.E., Hamdi, I.M., Ramesh, A.K., An, Z., et al.: Abbibench: A benchmark for antibody binding affinity maturation and design. arXiv preprint arXiv:2506.04235 (2025)

  41. [41]

    Scientific data6(1), 143 (2019)

    Sorkun, M.C., Khetan, A., Er, S.: Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of compounds. Scientific data6(1), 143 (2019)

  42. [42]

    Journal of Chemical Information and Modeling64(2), 340–347 (2024)

    Li, G., Yao, S., Fan, L.: Prostage: Predicting effects of mutations on protein stability by using protein embeddings and graph convolutional networks. Journal of Chemical Information and Modeling64(2), 340–347 (2024)

  43. [43]

    Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

    Gao, S., Zhu, R., Sui, P., Kong, Z., Aldogom, S., Huang, Y ., Noori, A., Shamji, R., Par- vataneni, K., Tsiligkaridis, T., et al.: Democratizing ai scientists using tooluniverse. arXiv preprint arXiv:2509.23426 (2025)

  44. [44]

    Nature630(8016), 493–500 (2024)

    Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Will- more, L., Ballard, A.J., Bambrick, J.,et al.: Accurate structure prediction of biomolecular interactions with alphafold 3. Nature630(8016), 493–500 (2024)

  45. [45]

    Bioinformatics40(7), 416 (2024)

    Swanson, K., Walther, P., Leitz, J., Mukherjee, S., Wu, J.C., Shivnaraine, R.V ., Zou, J.: Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries. Bioinformatics40(7), 416 (2024)

  46. [46]

    Nature biotechnology42(2), 243–246 (2024)

    Van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L., S ¨oding, J., Steinegger, M.: Fast and accurate protein structure search with foldseek. Nature biotechnology42(2), 243–246 (2024)

  47. [47]

    https://doi.org/10.6084/m9.figshare.25459573

    Maziarz, K.: USPTO-50K (2024). https://doi.org/10.6084/m9.figshare.25459573

  48. [48]

    18 Nucleic acids research52(D1), 1265–1275 (2024)

    Knox, C., Wilson, M., Klinger, C.M., Franklin, M., Oler, E., Wilson, A., Pon, A., Cox, J., Chin, N.E., Strawbridge, S.A.,et al.: Drugbank 6.0: the drugbank knowledgebase for 2024. 18 Nucleic acids research52(D1), 1265–1275 (2024)

  49. [49]

    Nucleic acids research28(1), 235–242 (2000)

    Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic acids research28(1), 235–242 (2000)

  50. [50]

    Journal of chemical information and computer sciences44(3), 1000–1005 (2004)

    Delaney, J.S.: Esol: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences44(3), 1000–1005 (2004)

  51. [51]

    Journal of computer-aided molecular design28(7), 711–720 (2014)

    Mobley, D.L., Guthrie, J.P.: Freesolv: a database of experimental and calculated hydration free energies, with input files. Journal of computer-aided molecular design28(7), 711–720 (2014)

  52. [52]

    The Journal of chemical physics143(8) (2015)

    Ramakrishnan, R., Hartmann, M., Tapavicza, E., V on Lilienfeld, O.A.: Electronic spectra from tddft and machine learning in chemical space. The Journal of chemical physics143(8) (2015)

  53. [53]

    Journal of chemical information and modeling52(11), 2864–2875 (2012)

    Ruddigkeit, L., Van Deursen, R., Blum, L.C., Reymond, J.-L.: Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling52(11), 2864–2875 (2012)

  54. [54]

    Journal of chemical information and modeling 52(6), 1686–1697 (2012)

    Martins, I.F., Teixeira, A.L., Pinheiro, L., Falcao, A.O.: A bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 52(6), 1686–1697 (2012)

  55. [55]

    arXiv preprint arXiv:2103.09430 (2021)

    Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y ., Leskovec, J.: Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430 (2021)

  56. [56]

    Nature methods17(5), 495–503 (2020)

    Jarzab, A., Kurzawa, N., Hopf, T., Moerch, M., Zecha, J., Leijten, N., Bian, Y ., Musiol, E., Maschberger, M., Stoehr, G.,et al.: Meltome atlas—thermal proteome stability across the tree of life. Nature methods17(5), 495–503 (2020)

  57. [57]

    Briefings in Bioinformatics23(2), 555 (2022)

    Pancotti, C., Benevenuta, S., Birolo, G., Alberini, V ., Repetto, V ., Sanavia, T., Capriotti, E., Fariselli, P.: Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Briefings in Bioinformatics23(2), 555 (2022)

  58. [58]

    elife5, 16965 (2016)

    Wu, N.C., Dai, L., Olson, C.A., Lloyd-Smith, J.O., Sun, R.: Adaptation in protein fitness landscapes is facilitated by indirect paths. elife5, 16965 (2016)

  59. [59]

    Nature533(7603), 397–401 (2016)

    Sarkisyan, K.S., Bolotin, D.A., Meer, M.V ., Usmanova, D.R., Mishin, A.S., Sharonov, G.V ., Ivankov, D.N., Bozhanova, N.G., Baranov, M.S., Soylemez, O.,et al.: Local fitness landscape of the green fluorescent protein. Nature533(7603), 397–401 (2016)

  60. [60]

    Nature communications12(1), 3168 (2021)

    Gligorijevi ´c, V ., Renfrew, P.D., Kosciolek, T., Leman, J.K., Berenberg, D., Vatanen, T., Chandler, C., Taylor, B.C., Fisk, I.M., Vlamakis, H.,et al.: Structure-based protein func- tion prediction using graph convolutional networks. Nature communications12(1), 3168 (2021)

  61. [61]

    In: International Conference on Machine 19 Learning, pp

    Zhou, J., Troyanskaya, O.: Deep supervised and convolutional generative stochastic net- work for protein secondary structure prediction. In: International Conference on Machine 19 Learning, pp. 745–753 (2014). PMLR

  62. [62]

    Proteins: Structure, Function, and Bioinformatics86, 7–15 (2018) https://doi.org/10.1002/prot.25415

    Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., Tramontano, A.: Critical assess- ment of methods of protein structure prediction (CASP)-Round XII. Proteins: Structure, Function, and Bioinformatics86, 7–15 (2018) https://doi.org/10.1002/prot.25415

  63. [63]

    Bioinformatics35(14), 305–314 (2019)

    Chen, M., Ju, C.J.-T., Zhou, G., Chen, X., Zhang, T., Chang, K.-W., Zaniolo, C., Wang, W.: Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics35(14), 305–314 (2019)

  64. [64]

    Briefings in bioinformatics23(2), 558 (2022)

    Song, B., Luo, X., Luo, X., Liu, Y ., Niu, Z., Zeng, X.: Learning spatial structures of pro- teins improves protein–protein interaction prediction. Briefings in bioinformatics23(2), 558 (2022)

  65. [65]

    Bioinformatics35(3), 462–469 (2019)

    Jankauskait ˙e, J., Jim ´enez-Garc´ıa, B., Dapk ¯unas, J., Fern ´andez-Recio, J., Moal, I.H.: Skempi 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics35(3), 462–469 (2019)

  66. [66]

    Nucleic acids research53(D1), 1633–1644 (2025)

    Liu, T., Hwang, L., Burley, S.K., Nitsche, C.I., Southan, C., Walters, W.P., Gilson, M.K.: Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data. Nucleic acids research53(D1), 1633–1644 (2025)

  67. [67]

    Nature biotechnology29(11), 1046–1051 (2011)

    Davis, M.I., Hunt, J.P., Herrgard, S., Ciceri, P., Wodicka, L.M., Pallares, G., Hocker, M., Treiber, D.K., Zarrinkar, P.P.: Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology29(11), 1046–1051 (2011)

  68. [68]

    https://huggingface.co/deepseek-ai/ DeepSeek-V3.1-Terminus

    DeepSeek-AI: DeepSeek-V3.1-Terminus (2025). https://huggingface.co/deepseek-ai/ DeepSeek-V3.1-Terminus

  69. [69]

    https://developers.openai.com/api/docs/models/gpt-5-mini

    OpenAI: GPT-5 mini (2025). https://developers.openai.com/api/docs/models/gpt-5-mini

  70. [70]

    gpt-oss-120b & gpt-oss-20b Model Card

    Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y ., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025)

  71. [71]

    https://huggingface.co/mistralai/ Mistral-Nemo-Instruct-2407

    Team, M.: Mistral-Nemo-Instruct-2407 (2024). https://huggingface.co/mistralai/ Mistral-Nemo-Instruct-2407

  72. [72]

    In: Proceedings of the 29th Symposium on Operating Systems Principles, pp

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with page- dattention. In: Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611–626 (2023) Appendix A Supplementary Table. 20 Supplementary Table: Detailed descriptions of be...