arxiv: 2604.03361 · v1 · submitted 2026-04-03 · 💻 cs.LG · q-bio.QM

Recognition: no theorem link

The limits of bio-molecular modeling with large language models : a cross-scale evaluation

Fengwei An, Tianyu Zhao, Yaxin Xu, Yue Zhou, Zhixiang Ren

Pith reviewed 2026-05-13 20:06 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords bio-molecular modelinglarge language modelsbenchmark evaluationchain-of-thought promptinghybrid mamba-attentionsupervised fine-tuningclassification versus regression

0 comments

The pith

A 26-task benchmark reveals large language models remain weak on bio-molecular regression despite strengths in classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds BioMol-LLM-Bench as a unified evaluation framework with 26 tasks spanning four difficulty levels and integrated computational tools to test LLMs on bio-molecular problems across scales. Testing thirteen representative models produces four concrete findings about their behavior. Chain-of-thought prompting delivers little gain and can lower accuracy on biological tasks. Hybrid architectures that combine mamba and attention layers handle long sequences more effectively than pure transformers. Supervised fine-tuning sharpens performance on narrow tasks while eroding broader generalization. Models classify bio-molecular properties reliably but falter on demanding regression predictions that require quantitative mechanistic insight.

Core claim

BioMol-LLM-Bench evaluation of thirteen LLMs demonstrates systematic gaps between model outputs and mechanistic understanding of multi-scale bio-molecular systems, shown through limited or negative effects of chain-of-thought data, advantages of hybrid mamba-attention architectures on long sequences, specialization-generalization trade-offs after supervised fine-tuning, and reliable classification paired with persistent weakness on regression tasks.

What carries the argument

BioMol-LLM-Bench, the proposed cross-scale benchmark framework consisting of 26 downstream tasks at four difficulty levels with tool augmentation.

If this is right

Chain-of-thought data should be used sparingly or omitted for biological tasks to avoid performance losses.
Hybrid mamba-attention models merit priority when processing extended bio-molecular sequences.
Supervised fine-tuning requires safeguards to retain generalization across molecular scales.
Current LLMs suit classification work on bio-molecular properties but require further advances for accurate regression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that embed explicit physical constraints could close the regression gap left by pure language modeling.
Expanding the benchmark with direct molecular-dynamics trajectories would test whether the observed limits hold under more mechanistic conditions.
Training mixtures that interleave experimental measurements with simulation data might reduce the specialization-generalization trade-off.

Load-bearing premise

The twenty-six chosen tasks sufficiently represent the mechanistic challenges of real multi-scale bio-molecular modeling.

What would settle it

An LLM that matches or exceeds baseline accuracy on held-out regression tasks such as quantitative prediction of binding free energies or reaction rates within the same benchmark setup would directly challenge the reported weakness.

read the original abstract

The modeling of bio-molecular system across molecular scales remains a central challenge in scientific research. Large language models (LLMs) are increasingly applied to bio-molecular discovery, yet systematic evaluation across multi-scale biological problems and rigorous assessment of their tool-augmented capabilities remain limited. We reveal a systematic gap between LLM performance and mechanistic understanding through the proposed cross-scale bio-molecular benchmark: BioMol-LLM-Bench, a unified framework comprising 26 downstream tasks that covers 4 distinct difficulty levels, and computational tools are integrated for a more comprehensive evaluation. Evaluation on 13 representative models reveals 4 main findings: chain-of-thought data provides limited benefit and may even reduce performance on biological tasks; hybrid mamba-attention architectures are more effective for long bio-molecular sequences; supervised fine-tuning improves specialization at the cost of generalization; and current LLMs perform well on classification tasks but remain weak on challenging regression tasks. Together, these findings provide practical guidance for future LLM-based modeling of molecular systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new BioMol-LLM-Bench benchmark and four empirical findings on LLM behavior are the real additions, but the tasks may not enforce genuine multi-scale biophysical constraints.

read the letter

The main thing here is the BioMol-LLM-Bench itself: 26 tasks grouped into four difficulty levels that span molecular scales, plus results from 13 models. The four reported patterns are straightforward: chain-of-thought data adds little or hurts on these tasks, hybrid Mamba-attention models handle longer sequences better than pure transformers, supervised fine-tuning trades generalization for task-specific gains, and classification works while hard regression does not. That gives people working on molecular LLMs some concrete data points they can check against their own setups. The benchmark construction and the architecture/training comparisons are the parts that feel fresh and worth looking at directly. The soft spot is the stress-test concern about what the tasks actually measure. If the higher difficulty levels mostly test sequence statistics or label correlations rather than things like energy conservation, conformational ensembles, or long-range physical interactions, then the headline gap between LLM performance and mechanistic understanding rests on weaker ground. The abstract gives no task-selection criteria, no error bars, and no statistical tests, so the strength of the four findings stays moderate until those details are checked. This is for readers who need an empirical snapshot of current LLM limits on bio-molecular problems rather than a deep theoretical advance. It is worth sending to referees because the benchmark is new and the model comparisons are reproducible enough to be useful, even if the interpretation of the results needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces BioMol-LLM-Bench, a unified benchmark with 26 downstream tasks spanning 4 difficulty levels for cross-scale bio-molecular modeling. It evaluates 13 representative LLMs and reports four findings: chain-of-thought data offers limited or negative benefit on biological tasks; hybrid mamba-attention architectures outperform on long sequences; supervised fine-tuning boosts specialization at the expense of generalization; and LLMs excel at classification but struggle with challenging regression tasks. The authors conclude this reveals a systematic gap between LLM performance and mechanistic understanding.

Significance. If the benchmark tasks genuinely probe multi-scale biophysical mechanisms rather than surface statistics, the empirical results across diverse models would provide actionable guidance for LLM architectures and training strategies in molecular biology and drug discovery. The broad model coverage is a positive aspect of the evaluation.

major comments (2)

[Benchmark construction] The central claim of a 'systematic gap between LLM performance and mechanistic understanding' depends on the 26 tasks in BioMol-LLM-Bench requiring capture of physical cross-scale phenomena. The paper groups tasks into four difficulty levels but provides no explicit mapping demonstrating that higher levels enforce biophysical constraints such as energy conservation, force-field consistency, or long-range allostery (see abstract and benchmark construction description).
[Abstract] Abstract: The description of the benchmark and findings omits details on task selection criteria, the statistical tests used to support the four conclusions, and error bars on reported performance metrics, which limits assessment of the robustness of the observed gaps.

minor comments (1)

[Results] The four findings are listed clearly in the abstract but would be strengthened by explicit quantitative comparisons (e.g., performance deltas) in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback highlights important opportunities to strengthen the connection between benchmark tasks and biophysical principles as well as to improve clarity in the abstract. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Benchmark construction] The central claim of a 'systematic gap between LLM performance and mechanistic understanding' depends on the 26 tasks in BioMol-LLM-Bench requiring capture of physical cross-scale phenomena. The paper groups tasks into four difficulty levels but provides no explicit mapping demonstrating that higher levels enforce biophysical constraints such as energy conservation, force-field consistency, or long-range allostery (see abstract and benchmark construction description).

Authors: We appreciate this observation. The difficulty levels were designed to progressively incorporate tasks that demand modeling of cross-scale interactions (e.g., level 3–4 tasks include multi-domain proteins and allosteric effects), which in practice require capturing biophysical consistency beyond surface statistics. However, we acknowledge that an explicit mapping table linking each level to specific constraints such as energy conservation or force-field consistency was not included. We will add a dedicated subsection (and accompanying table) in the revised benchmark construction section that explicitly maps task levels to the biophysical principles they probe, with concrete examples drawn from the 26 tasks. This will directly support the central claim. revision: yes
Referee: [Abstract] Abstract: The description of the benchmark and findings omits details on task selection criteria, the statistical tests used to support the four conclusions, and error bars on reported performance metrics, which limits assessment of the robustness of the observed gaps.

Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will expand the abstract to briefly note: (i) task selection criteria (coverage across molecular scales from sequence to structure-function with four graded difficulty levels), (ii) the statistical tests employed (paired t-tests and Wilcoxon rank-sum tests for model comparisons, with p-values reported in the main text), and (iii) that all performance metrics include error bars (standard deviation across three random seeds, shown in Figures 2–5). These details are already present in the methods and results sections; the abstract revision will make them visible at a glance without exceeding length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark evaluation

full rationale

The paper introduces BioMol-LLM-Bench as a new collection of 26 tasks across difficulty levels and reports performance of 13 external LLMs on them. All four main findings are direct observations from these runs (e.g., CoT benefit, architecture comparisons, SFT effects, classification vs. regression gaps). No equations, fitted parameters, or predictions are defined in terms of the target results; the benchmark tasks and metrics are external to any model output. Self-citations, if present, are not load-bearing for any derivation. The evaluation is therefore self-contained against external models and tasks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the chosen tasks represent key bio-molecular challenges; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 26 tasks in BioMol-LLM-Bench adequately represent multi-scale bio-molecular modeling problems
Benchmark construction and all performance claims depend on this premise.

pith-pipeline@v0.9.0 · 5481 in / 1134 out tokens · 43849 ms · 2026-05-13T20:06:15.140863+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 8 internal anchors

[1]

Current opinion in chemical biology8(1), 91–97 (2004)

Kortemme, T., Baker, D.: Computational design of protein–protein interactions. Current opinion in chemical biology8(1), 91–97 (2004)

work page 2004
[2]

science 319(5867), 1215–1220 (2008)

Gibson, D.G., Benders, G.A., Andrews-Pfannkoch, C., Denisova, E.A., Baden-Tillson, H., Zaveri, J., Stockwell, T.B., Brownley, A., Thomas, D.W., Algire, M.A.,et al.: Complete chemical synthesis, assembly, and cloning of a mycoplasma genitalium genome. science 319(5867), 1215–1220 (2008)

work page 2008
[3]

Nature459(7244), 239–242 (2009)

Powner, M.W., Gerland, B., Sutherland, J.D.: Synthesis of activated pyrimidine ribonu- cleotides in prebiotically plausible conditions. Nature459(7244), 239–242 (2009)

work page 2009
[4]

Science335(6070), 831–834 (2012)

Douglas, S.M., Bachelet, I., Church, G.M.: A logic-gated nanorobot for targeted transport of molecular payloads. Science335(6070), 831–834 (2012)

work page 2012
[5]

Nature428(6982), 487–492 (2004)

Langer, R., Tirrell, D.A.: Designing materials for biology and medicine. Nature428(6982), 487–492 (2004)

work page 2004
[6]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Gemma: Open Models Based on Gemini Research and Technology

Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi`ere, M., Kale, M.S., Love, J., et al.: Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Nature 645(8081), 633–638 (2025)

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645(8081), 633–638 (2025)

work page 2025
[10]

Qwen Technical Report

Bai, J., Bai, S., Chu, Y ., Cui, Z., Dang, K., Deng, X., Fan, Y ., Ge, W., Han, Y ., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., 15 Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

arXiv preprint arXiv:2508.14444 (2025)

Basant, A., Khairnar, A., Paithankar, A., Khattar, A., Renduchintala, A., Malte, A., Bercovich, A., Hazare, A., Rico, A., Ficek, A., et al.: Nvidia nemotron nano 2: An accurate and efficient hybrid mamba-transformer reasoning model. arXiv preprint arXiv:2508.14444 (2025)

work page arXiv 2025
[13]

Phi-4 Technical Report

Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R.J., Javaheripi, M., Kauffmann, P., et al.: Phi-4 technical report. arXiv preprint arXiv:2412.08905 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

M., Cox, S., Schilter, O., Baldassari, C., White, A

Bran, A.M., Cox, S., Schilter, O., Baldassari, C., White, A.D., Schwaller, P.: Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376 (2023)

work page arXiv 2023
[15]

Beltagy, I., Lo, K., Cohan, A.: Scibert: A pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019)

work page 2019
[16]

arXiv preprint arXiv:2504.06196 (2025)

Wang, E., Schmidgall, S., Jaeger, P.F., Zhang, F., Pilgrim, R., Matias, Y ., Barral, J., Fleet, D., Azizi, S.: Txgemma: Efficient and agentic llms for therapeutics. arXiv preprint arXiv:2504.06196 (2025)

work page arXiv 2025
[17]

In: Findings of the Association for Computational Linguistics: ACL 2024, pp

Pei, Q., Wu, L., Gao, K., Liang, X., Fang, Y ., Zhu, J., Xie, S., Qin, T., Yan, R.: Biot5+: Towards generalized biological understanding with iupac integration and multi-task tuning. In: Findings of the Association for Computational Linguistics: ACL 2024, pp. 1216–1240 (2024)

work page 2024
[18]

arXiv e-prints, 2502 (2025)

Xia, Y ., Jin, P., Xie, S., He, L., Cao, C., Luo, R., Liu, G., Wang, Y ., Liu, Z., Chen, Y .-J., et al.: Naturelm: Deciphering the language of nature for scientific discovery. arXiv e-prints, 2502 (2025)

work page 2025
[19]

Nature Machine Intelligence7(7), 1154–1167 (2025)

Zhuang, X., Ding, K., Lyu, T., Jiang, Y ., Li, X., Xiang, Z., Wang, Z., Qin, M., Feng, K., Wang, J.,et al.: Advancing biomolecular understanding and design following human instructions. Nature Machine Intelligence7(7), 1154–1167 (2025)

work page 2025
[20]

In: First Conference on Language Modeling

Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y ., Dirani, J., Michael, J., Bow- man, S.R.: Gpqa: A graduate-level google-proof q&a benchmark. In: First Conference on Language Modeling

work page
[21]

Advances in Neural Information Processing Systems37, 95266–95290 (2024)

Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z.,et al.: Mmlu-pro: A more robust and challenging multi-task language understand- ing benchmark. Advances in Neural Information Processing Systems37, 95266–95290 (2024)

work page 2024
[22]

arXiv preprint arXiv:2307.10635

Wang, X., Hu, Z., Lu, P., Zhu, Y ., Zhang, J., Subramaniam, S., Loomba, A.R., Zhang, S., Sun, Y ., Wang, W.: Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635 (2023) 16

work page arXiv 2023
[23]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Sun, L., Han, Y ., Zhao, Z., Ma, D., Shen, Z., Chen, B., Chen, L., Yu, K.: Scieval: A multi- level large language model evaluation benchmark for scientific research. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19053–19061 (2024)

work page 2024
[24]

In: Proceedings of Th e 10th International Conference on Artificial Intelligence and Soft Computing, Sydney, Australia (2024)

Olea, C., Tucker, H., Phelan, J., Pattison, C., Zhang, S., Lieb, M., Schmidt, D., White, J.: Evaluating persona prompting for question answering tasks. In: Proceedings of Th e 10th International Conference on Artificial Intelligence and Soft Computing, Sydney, Australia (2024)

work page 2024
[25]

Humanity's Last Exam

Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C.B.C., Shaaban, M., Ling, J., Shi, S., et al.: Humanity’s last exam. arXiv preprint arXiv:2501.14249 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

International Journal on Digital Libraries 23(3), 289–301 (2022)

Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: Scienceqa: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries 23(3), 289–301 (2022)

work page 2022
[27]

arXiv preprint arXiv:2407.10362 (2024)

Laurent, J.M., Janizek, J.D., Ruzo, M., Hinks, M.M., Hammerling, M.J., Narayanan, S., Ponnapati, M., White, A.D., Rodriques, S.G.: Lab-bench: Measuring capabilities of language models for biology research. arXiv preprint arXiv:2407.10362 (2024)

work page arXiv 2024
[28]

In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp

Shen, Y ., Chen, Z., Mamalakis, M., He, L., Xia, H., Li, T., Su, Y ., He, J., Wang, Y .G.: A fine-tuning dataset and benchmark for large language models for protein understanding. In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2390–2395 (2024). IEEE

work page 2024
[29]

Bioinformatics26(23), 3000–3001 (2010)

Walker, T., Grulke, C.M., Pozefsky, D., Tropsha, A.: Chembench: a cheminformatics workbench. Bioinformatics26(23), 3000–3001 (2010)

work page 2010
[30]

arXiv preprint arXiv:2402.09391 (2024)

Yu, B., Baker, F.N., Chen, Z., Ning, X., Sun, H.: Llasmol: Advancing large language mod- els for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391 (2024)

work page arXiv 2024
[31]

Chemical science 9(2), 513–530 (2018)

Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S., Leswing, K., Pande, V .: Moleculenet: a benchmark for molecular machine learning. Chemical science 9(2), 513–530 (2018)

work page 2018
[32]

arXiv preprint arXiv:2310.00115 (2023)

Zhu, Y ., Hwang, J., Adams, K., Liu, Z., Nan, B., Stenfors, B., Du, Y ., Chauhan, J., Wiest, O., Isayev, O., et al.: Learning over molecular conformer ensembles: Datasets and benchmarks. arXiv preprint arXiv:2310.00115 (2023)

work page arXiv 2023
[33]

Advances in neural information processing systems32(2019)

Rao, R., Bhattacharya, N., Thomas, N., Duan, Y ., Chen, P., Canny, J., Abbeel, P., Song, Y .: Evaluating protein transfer learning with tape. Advances in neural information processing systems32(2019)

work page 2019
[34]

bioRxiv, 2021–11 (2021)

Dallago, C., Mou, J., Johnston, K.E., Wittmann, B.J., Bhattacharya, N., Goldman, S., Madani, A., Yang, K.K.: Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, 2021–11 (2021)

work page 2021
[35]

Txagent: an ai agent for therapeutic reasoning across a universe of tools.arXiv preprint arXiv:2503.10970, 2025

Gao, S., Zhu, R., Kong, Z., Noori, A., Su, X., Ginder, C., Tsiligkaridis, T., Zitnik, M.: Txagent: an ai agent for therapeutic reasoning across a universe of tools. arXiv preprint 17 arXiv:2503.10970 (2025)

work page arXiv 2025
[36]

Nature Computational Science 5(10), 962–972 (2025)

Ding, K., Yu, J., Huang, J., Yang, Y ., Zhang, Q., Chen, H.: Scitoolagent: a knowledge- graph-driven scientific agent for multitool integration. Nature Computational Science 5(10), 962–972 (2025)

work page 2025
[37]

Advances in neural information processing systems 36, 64331–64379 (2023)

Notin, P., Kollasch, A., Ritter, D., Van Niekerk, L., Paul, S., Spinner, H., Rollins, N., Shaw, A., Orenbuch, R., Weitzman, R.,et al.: Proteingym: Large-scale benchmarks for protein fitness prediction and design. Advances in neural information processing systems 36, 64331–64379 (2023)

work page 2023
[38]

Nucleic acids research50(W1), 228–234 (2022)

Thumuluri, V ., Almagro Armenteros, J.J., Johansen, A.R., Nielsen, H., Winther, O.: Deeploc 2.0: multi-label subcellular localization prediction using protein language models. Nucleic acids research50(W1), 228–234 (2022)

work page 2022
[39]

Bioinformatics36(22-23), 5545–5547 (2020)

Huang, K., Fu, T., Glass, L.M., Zitnik, M., Xiao, C., Sun, J.: Deeppurpose: a deep learn- ing library for drug–target interaction prediction. Bioinformatics36(22-23), 5545–5547 (2020)

work page 2020
[40]

arXiv preprint arXiv:2506.04235 (2025)

Zhao, X., Tang, Y .-C., Singh, A., Cantu, V .J., An, K., Lee, J., Stogsdill, A.E., Hamdi, I.M., Ramesh, A.K., An, Z., et al.: Abbibench: A benchmark for antibody binding affinity maturation and design. arXiv preprint arXiv:2506.04235 (2025)

work page arXiv 2025
[41]

Scientific data6(1), 143 (2019)

Sorkun, M.C., Khetan, A., Er, S.: Aqsoldb, a curated reference set of aqueous solubility and 2d descriptors for a diverse set of compounds. Scientific data6(1), 143 (2019)

work page 2019
[42]

Journal of Chemical Information and Modeling64(2), 340–347 (2024)

Li, G., Yao, S., Fan, L.: Prostage: Predicting effects of mutations on protein stability by using protein embeddings and graph convolutional networks. Journal of Chemical Information and Modeling64(2), 340–347 (2024)

work page 2024
[43]

Democratizing ai scientists using tooluniverse.arXiv preprint arXiv:2509.23426, 2025

Gao, S., Zhu, R., Sui, P., Kong, Z., Aldogom, S., Huang, Y ., Noori, A., Shamji, R., Par- vataneni, K., Tsiligkaridis, T., et al.: Democratizing ai scientists using tooluniverse. arXiv preprint arXiv:2509.23426 (2025)

work page arXiv 2025
[44]

Nature630(8016), 493–500 (2024)

Abramson, J., Adler, J., Dunger, J., Evans, R., Green, T., Pritzel, A., Ronneberger, O., Will- more, L., Ballard, A.J., Bambrick, J.,et al.: Accurate structure prediction of biomolecular interactions with alphafold 3. Nature630(8016), 493–500 (2024)

work page 2024
[45]

Bioinformatics40(7), 416 (2024)

Swanson, K., Walther, P., Leitz, J., Mukherjee, S., Wu, J.C., Shivnaraine, R.V ., Zou, J.: Admet-ai: a machine learning admet platform for evaluation of large-scale chemical libraries. Bioinformatics40(7), 416 (2024)

work page 2024
[46]

Nature biotechnology42(2), 243–246 (2024)

Van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L., S ¨oding, J., Steinegger, M.: Fast and accurate protein structure search with foldseek. Nature biotechnology42(2), 243–246 (2024)

work page 2024
[47]

https://doi.org/10.6084/m9.figshare.25459573

Maziarz, K.: USPTO-50K (2024). https://doi.org/10.6084/m9.figshare.25459573

work page doi:10.6084/m9.figshare.25459573 2024
[48]

18 Nucleic acids research52(D1), 1265–1275 (2024)

Knox, C., Wilson, M., Klinger, C.M., Franklin, M., Oler, E., Wilson, A., Pon, A., Cox, J., Chin, N.E., Strawbridge, S.A.,et al.: Drugbank 6.0: the drugbank knowledgebase for 2024. 18 Nucleic acids research52(D1), 1265–1275 (2024)

work page 2024
[49]

Nucleic acids research28(1), 235–242 (2000)

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic acids research28(1), 235–242 (2000)

work page 2000
[50]

Journal of chemical information and computer sciences44(3), 1000–1005 (2004)

Delaney, J.S.: Esol: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences44(3), 1000–1005 (2004)

work page 2004
[51]

Journal of computer-aided molecular design28(7), 711–720 (2014)

Mobley, D.L., Guthrie, J.P.: Freesolv: a database of experimental and calculated hydration free energies, with input files. Journal of computer-aided molecular design28(7), 711–720 (2014)

work page 2014
[52]

The Journal of chemical physics143(8) (2015)

Ramakrishnan, R., Hartmann, M., Tapavicza, E., V on Lilienfeld, O.A.: Electronic spectra from tddft and machine learning in chemical space. The Journal of chemical physics143(8) (2015)

work page 2015
[53]

Journal of chemical information and modeling52(11), 2864–2875 (2012)

Ruddigkeit, L., Van Deursen, R., Blum, L.C., Reymond, J.-L.: Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling52(11), 2864–2875 (2012)

work page 2012
[54]

Journal of chemical information and modeling 52(6), 1686–1697 (2012)

Martins, I.F., Teixeira, A.L., Pinheiro, L., Falcao, A.O.: A bayesian approach to in silico blood-brain barrier penetration modeling. Journal of chemical information and modeling 52(6), 1686–1697 (2012)

work page 2012
[55]

arXiv preprint arXiv:2103.09430 (2021)

Hu, W., Fey, M., Ren, H., Nakata, M., Dong, Y ., Leskovec, J.: Ogb-lsc: A large-scale challenge for machine learning on graphs. arXiv preprint arXiv:2103.09430 (2021)

work page arXiv 2021
[56]

Nature methods17(5), 495–503 (2020)

Jarzab, A., Kurzawa, N., Hopf, T., Moerch, M., Zecha, J., Leijten, N., Bian, Y ., Musiol, E., Maschberger, M., Stoehr, G.,et al.: Meltome atlas—thermal proteome stability across the tree of life. Nature methods17(5), 495–503 (2020)

work page 2020
[57]

Briefings in Bioinformatics23(2), 555 (2022)

Pancotti, C., Benevenuta, S., Birolo, G., Alberini, V ., Repetto, V ., Sanavia, T., Capriotti, E., Fariselli, P.: Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset. Briefings in Bioinformatics23(2), 555 (2022)

work page 2022
[58]

elife5, 16965 (2016)

Wu, N.C., Dai, L., Olson, C.A., Lloyd-Smith, J.O., Sun, R.: Adaptation in protein fitness landscapes is facilitated by indirect paths. elife5, 16965 (2016)

work page 2016
[59]

Nature533(7603), 397–401 (2016)

Sarkisyan, K.S., Bolotin, D.A., Meer, M.V ., Usmanova, D.R., Mishin, A.S., Sharonov, G.V ., Ivankov, D.N., Bozhanova, N.G., Baranov, M.S., Soylemez, O.,et al.: Local fitness landscape of the green fluorescent protein. Nature533(7603), 397–401 (2016)

work page 2016
[60]

Nature communications12(1), 3168 (2021)

Gligorijevi ´c, V ., Renfrew, P.D., Kosciolek, T., Leman, J.K., Berenberg, D., Vatanen, T., Chandler, C., Taylor, B.C., Fisk, I.M., Vlamakis, H.,et al.: Structure-based protein func- tion prediction using graph convolutional networks. Nature communications12(1), 3168 (2021)

work page 2021
[61]

In: International Conference on Machine 19 Learning, pp

Zhou, J., Troyanskaya, O.: Deep supervised and convolutional generative stochastic net- work for protein secondary structure prediction. In: International Conference on Machine 19 Learning, pp. 745–753 (2014). PMLR

work page 2014
[62]

Proteins: Structure, Function, and Bioinformatics86, 7–15 (2018) https://doi.org/10.1002/prot.25415

Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., Tramontano, A.: Critical assess- ment of methods of protein structure prediction (CASP)-Round XII. Proteins: Structure, Function, and Bioinformatics86, 7–15 (2018) https://doi.org/10.1002/prot.25415

work page doi:10.1002/prot.25415 2018
[63]

Bioinformatics35(14), 305–314 (2019)

Chen, M., Ju, C.J.-T., Zhou, G., Chen, X., Zhang, T., Chang, K.-W., Zaniolo, C., Wang, W.: Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics35(14), 305–314 (2019)

work page 2019
[64]

Briefings in bioinformatics23(2), 558 (2022)

Song, B., Luo, X., Luo, X., Liu, Y ., Niu, Z., Zeng, X.: Learning spatial structures of pro- teins improves protein–protein interaction prediction. Briefings in bioinformatics23(2), 558 (2022)

work page 2022
[65]

Bioinformatics35(3), 462–469 (2019)

Jankauskait ˙e, J., Jim ´enez-Garc´ıa, B., Dapk ¯unas, J., Fern ´andez-Recio, J., Moal, I.H.: Skempi 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics35(3), 462–469 (2019)

work page 2019
[66]

Nucleic acids research53(D1), 1633–1644 (2025)

Liu, T., Hwang, L., Burley, S.K., Nitsche, C.I., Southan, C., Walters, W.P., Gilson, M.K.: Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data. Nucleic acids research53(D1), 1633–1644 (2025)

work page 2024
[67]

Nature biotechnology29(11), 1046–1051 (2011)

Davis, M.I., Hunt, J.P., Herrgard, S., Ciceri, P., Wodicka, L.M., Pallares, G., Hocker, M., Treiber, D.K., Zarrinkar, P.P.: Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology29(11), 1046–1051 (2011)

work page 2011
[68]

https://huggingface.co/deepseek-ai/ DeepSeek-V3.1-Terminus

DeepSeek-AI: DeepSeek-V3.1-Terminus (2025). https://huggingface.co/deepseek-ai/ DeepSeek-V3.1-Terminus

work page 2025
[69]

https://developers.openai.com/api/docs/models/gpt-5-mini

OpenAI: GPT-5 mini (2025). https://developers.openai.com/api/docs/models/gpt-5-mini

work page 2025
[70]

gpt-oss-120b & gpt-oss-20b Model Card

Agarwal, S., Ahmad, L., Ai, J., Altman, S., Applebaum, A., Arbus, E., Arora, R.K., Bai, Y ., Baker, B., Bao, H., et al.: gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

https://huggingface.co/mistralai/ Mistral-Nemo-Instruct-2407

Team, M.: Mistral-Nemo-Instruct-2407 (2024). https://huggingface.co/mistralai/ Mistral-Nemo-Instruct-2407

work page 2024
[72]

In: Proceedings of the 29th Symposium on Operating Systems Principles, pp

Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C.H., Gonzalez, J., Zhang, H., Stoica, I.: Efficient memory management for large language model serving with page- dattention. In: Proceedings of the 29th Symposium on Operating Systems Principles, pp. 611–626 (2023) Appendix A Supplementary Table. 20 Supplementary Table: Detailed descriptions of be...

work page 2023