BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Chang-Yu Hsieh; Chengping Li; Conghui He; Han Guo; Liang He; Lijun Wu; Qizhi Pei; Rui Yan; Wei Li; Yi Duan

arxiv: 2606.22138 · v1 · pith:SQ4L2XPCnew · submitted 2026-06-20 · 💻 cs.CL · cs.AI· cs.LG· q-bio.BM

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Qizhi Pei , Zhimeng Zhou , Yi Duan , Yiyang Zhao , Wei Li , Han Guo , Liang He , Chengping Li

show 4 more authors

Chang-Yu Hsieh Conghui He Rui Yan Lijun Wu

This is my paper

Pith reviewed 2026-06-26 11:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGq-bio.BM

keywords multimodal foundation modelbiological sequencesmolecular structuresprotein structuresunified tokenizationnext-token predictionmolecules and proteinsdecoder-only architecture

0 comments

The pith

A unified tokenization scheme maps molecular sequences, structures, protein sequences, structures and natural language into one shared space so a single decoder-only model can consume and generate all of them under next-token prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a foundation model that brings sequences, structures and text for both molecules and proteins into the same architecture. It does this by converting every input type into tokens from one vocabulary, allowing the model to read and write any combination through ordinary next-token prediction. No separate encoders, adapters or output heads are required for different modalities. If the approach holds, it would mean a generalist model can replace collections of specialized tools across understanding and generation tasks in biology. The work reports competitive or leading results on most of a large benchmark suite covering single-entity and cross-entity problems.

Core claim

BioMatrix is the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. It achieves this by mapping molecular sequences (SMILES and SELFIES), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective without external encoders, projection adapters, or modality-specific output heads.

What carries the argument

The unified tokenization scheme that converts every modality into tokens from a single vocabulary, enabling uniform next-token prediction across sequences, structures and language.

If this is right

The model reaches state-of-the-art or competitive results on 77 of 80 tasks spanning understanding and generation within and across modalities.
Both single-entity and multi-entity tasks, including molecule-protein and protein-protein interactions, become addressable inside one model.
Continual pretraining on hundreds of billions of tokens that mix text, sequences, structures and cross-modal pairs produces the observed breadth of capability.
Specialized models are no longer required for most of the covered biological tasks once the unified token space is in place.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tokenization logic could be tested on additional entity types such as metabolites or genes to check whether the generalist pattern extends further.
Deployment cost may drop because one set of weights replaces multiple modality-specific models and their associated adapters.
Zero-shot transfer between previously separate tasks, such as generating a protein structure directly from a textual description of its function, becomes a natural next measurement.
If the token space preserves enough geometric detail, downstream simulation or docking tools could operate directly on the model's generated tokens.

Load-bearing premise

A single discrete tokenization scheme can represent both sequence and three-dimensional structural information for molecules and proteins without meaningful loss or the need for modality-specific components.

What would settle it

Demonstration that the model cannot produce chemically valid molecular structures or biologically plausible protein folds when asked to generate from sequence tokens alone, or that removing the structure tokens from training collapses performance on structure-related tasks.

Figures

Figures reproduced from arXiv: 2606.22138 by Chang-Yu Hsieh, Chengping Li, Conghui He, Han Guo, Liang He, Lijun Wu, Qizhi Pei, Rui Yan, Wei Li, Yi Duan, Yiyang Zhao, Zhimeng Zhou.

**Figure 2.** Figure 2: 3D structure tokenization pipelines in BioMatrix. The molecule structure tokenizer converts each 3D conformer into SELFIES-aligned joint 1D–3D tokens through local spherical descriptors and a 512-entry vector-quantized codebook, while the protein structure tokenizer encodes backbone geometry into per-residue structure tokens using a GCPNet-based encoder and a 4096-entry codebook. Dashed arrows indicate the… view at source ↗

**Figure 3.** Figure 3: Overview of the continual pretraining corpus composition. Token budget distribution across [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Training loss curves of BioMatrix-1.7B and BioMatrix-4B during multimodal continual [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of the molecular task suite evaluated in this section. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the protein task suite evaluated in this section. [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗

**Figure 7.** Figure 7: Relative gain from scaling BioMatrix from 1.7B to 4B. The score compares only the two [PITH_FULL_IMAGE:figures/full_fig_p050_7.png] view at source ↗

read the original abstract

We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BioMatrix claims the first native decoder-only model for sequences, structures, and language across molecules and proteins, but the abstract gives no tokenization details so the central claim stays unverified.

read the letter

The punchline is that this paper positions BioMatrix as closing a gap by putting sequences, 3D structures, and text for both molecules and proteins into one decoder-only model under a single next-token objective, without adapters or separate heads. It reports SOTA or competitive results on 77 of 80 tasks after pretraining on 304 billion tokens that mix general text, domain text, sequences, structures, and interaction data.

What stands out as new is the explicit combination: one model that covers both entity types and all three modalities natively, plus the use of cross-entity linking data during pretraining. The scale of the downstream suite (80 tasks across understanding and generation) is also broader than most prior single-entity or adapter-based efforts.

The soft spot is exactly the one the stress-test flags. The abstract describes a unified discrete tokenization scheme that maps 3D structures into the same vocabulary as SMILES, SELFIES, and text, but supplies no equations, pseudocode, or reconstruction metrics. Without that, it is impossible to tell whether generation stays purely autoregressive or whether the tokenizer embeds modality-specific logic that the model itself does not handle uniformly. The performance numbers also lack any mention of task definitions, baselines, splits, or significance tests, so the 77/80 figure cannot be assessed from what is given.

This is for groups already building or comparing biological foundation models who want to see whether a single generalist can replace several specialized ones. A reader focused on multimodal tokenization or bio generation tasks would find the scope useful even if the methods section needs close checking.

It deserves peer review because the ambition is real and the data volume is substantial; a referee can test whether the tokenization actually supports lossless structure generation and whether the evaluations hold up. I would send it out rather than desk reject.

Referee Report

2 major / 0 minor

Summary. The paper introduces BioMatrix, a decoder-only foundation model built on Qwen3 (1.7B/4B) that maps molecular sequences (SMILES/SELFIES), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space via a unified tokenization scheme. All modalities are handled uniformly under next-token prediction pretraining on 304.4B tokens (including cross-modal interaction data), followed by tuning on 80 tasks across 6 categories of single- and multi-entity understanding/generation; it reports SOTA or competitive results on 77/80 tasks.

Significance. If the unified tokenization supports lossless native generation of 3D structures under pure next-token prediction and the performance results are shown to be robust, this would be a notable contribution by demonstrating that a single generalist decoder-only model can span multiple biological entity types and modalities without adapters or modality-specific heads, potentially simplifying the landscape of biological foundation models.

major comments (2)

[Abstract] Abstract: the central claim that BioMatrix 'achieves state-of-the-art or competitive performance on 77 out of 80 tasks' supplies no information on task definitions, baselines, data splits, statistical significance testing, or how structures are tokenized and generated; without these, the performance claim cannot be evaluated and is load-bearing for the paper's main result.
[Abstract] Abstract (and the section describing the unified tokenization scheme): no equations, pseudocode, or algorithmic detail is given for discretizing 3D molecular/protein structures (coordinates or graphs) into the shared vocabulary alongside SMILES/SELFIES and text; this leaves open whether generation remains purely next-token prediction without information loss or implicit modality-specific logic, which directly underpins the 'natively integrates ... without external encoders, projection adapters, or modality-specific output heads' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and evaluability while preserving the manuscript's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that BioMatrix 'achieves state-of-the-art or competitive performance on 77 out of 80 tasks' supplies no information on task definitions, baselines, data splits, statistical significance testing, or how structures are tokenized and generated; without these, the performance claim cannot be evaluated and is load-bearing for the paper's main result.

Authors: We agree the abstract is high-level by design. Full details on the 80 tasks (definitions, baselines, splits, and significance testing) appear in Section 4 and the supplementary material; structure tokenization/generation is covered in Section 3. To address evaluability concerns, we will revise the abstract to briefly reference the evaluation protocol and direct readers to the relevant sections for complete information. revision: partial
Referee: [Abstract] Abstract (and the section describing the unified tokenization scheme): no equations, pseudocode, or algorithmic detail is given for discretizing 3D molecular/protein structures (coordinates or graphs) into the shared vocabulary alongside SMILES/SELFIES and text; this leaves open whether generation remains purely next-token prediction without information loss or implicit modality-specific logic, which directly underpins the 'natively integrates ... without external encoders, projection adapters, or modality-specific output heads' claim.

Authors: We acknowledge the need for greater explicitness. While Section 3 describes the unified tokenization scheme that places all modalities (including 3D structures) into a shared discrete vocabulary for uniform next-token prediction, we will add equations and pseudocode for the 3D discretization process to the main text. This will explicitly confirm the absence of information loss or modality-specific logic. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on downstream evaluation

full rationale

The paper describes an empirical multimodal foundation model built on Qwen3 with a unified tokenization scheme for sequences, structures, and language. No equations, derivations, or parameter-fitting steps are presented that would reduce any claimed prediction or result to inputs by construction. Performance on 80 tasks is reported as external validation rather than a quantity defined by the pretraining objective. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way within the provided text. The architecture claim is an engineering choice evaluated empirically, not a self-referential derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that a unified discrete tokenization of structures is lossless enough for generation and that next-token prediction alone suffices for cross-modal alignment; no independent evidence for these modeling choices is supplied in the abstract.

free parameters (1)

base model sizes
1.7B and 4B parameter variants chosen from Qwen3; these are architectural decisions that affect capacity but are not derived from the biological data.

axioms (1)

domain assumption Next-token prediction on a shared discrete token space is sufficient to learn and generate across sequences, structures, and language without modality-specific components.
Invoked in the description of the unified tokenization scheme and single-objective training.

pith-pipeline@v0.9.1-grok · 5879 in / 1489 out tokens · 26890 ms · 2026-06-26T11:44:39.629520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

155 extracted references · 1 canonical work pages

[1]

Uniprot: the universal protein knowledgebase in 2023.Nucleic acids research, 51(D1):D523–D531, 2023

2023
[2]

Open-AlphaSeq: Open protein–protein interaction affinity datasets, 2025

A-Alpha Bio. Open-AlphaSeq: Open protein–protein interaction affinity datasets, 2025. URL https: //huggingface.co/datasets/aalphabio/open-alphaseq

2025
[3]

Prot2text: Multi- modal protein’s function generation with gnns and transformers

Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, and Michalis Vazirgiannis. Prot2text: Multi- modal protein’s function generation with gnns and transformers. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 10757–10765, 2024

2024
[4]

Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

2024
[5]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[6]

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Pith/arXiv arXiv 2025
[7]

Protein generation with evolutionary diffusion: sequence is all you need

Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Lu, Nicolo Fusi, Ava Amini, and Kevin Yang. Protein generation with evolutionary diffusion: sequence is all you need. InNeurIPS 2023 Generative AI and Biology (GenBio) Workshop

2023
[8]

Claude 3.5 Sonnet model card addendum, 2024

Anthropic. Claude 3.5 Sonnet model card addendum, 2024. URL https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model Card Claude 3 Addendum.pdf

2024
[9]

Claude Opus 4.6 system card

Anthropic. Claude Opus 4.6 system card. Technical report, Anthropic, February 2026. URL https: //www.anthropic.com/claude-opus-4-6-system-card

2026
[10]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

2024
[11]

Viraj Bagal, Rishal Aggarwal, P . K. Vinod, and U. Deva Priyakumar. Molgpt: Molecular generation using a transformer-decoder model.J. Chem. Inf. Model., 62(9):2064–2076, 2022

2064
[12]

Equivariant energy-guided SDE for inverse molecular design

Fan Bao, Min Zhao, Zhongkai Hao, Peiyao Li, Chongxuan Li, and Jun Zhu. Equivariant energy-guided SDE for inverse molecular design. InICLR. OpenReview.net, 2023

2023
[13]

The protein data bank.Nucleic acids research, 28(1):235–242, 2000

Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank.Nucleic acids research, 28(1):235–242, 2000. 53 BioMatrix

2000
[14]

Alphafold protein structure database 2025: a redesigned interface and updated structural coverage.Nucleic Acids Research, 54 (D1):D358–D362, 2026

Damian Bertoni, Maxim Tsenkov, Paulyna Magana, Sreenath Nair, Ivanna Pidruchna, Marcelo Querino Lima Afonso, Adam Midlik, Urmila Paramval, Dare Lawal, Ahsan Tanweer, et al. Alphafold protein structure database 2025: a redesigned interface and updated structural coverage.Nucleic Acids Research, 54 (D1):D358–D362, 2026

2025
[15]

Bronstein, and Alexander Tong

Avishek Joey Bose, Tara Akhound-Sadegh, Guillaume Huguet, Kilian Fatras, Jarrid Rector-Brooks, Cheng- Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael M. Bronstein, and Alexander Tong. Se(3)- stochastic flow matching for protein backbone generation. InICLR. OpenReview.net, 2024

2024
[16]

Nathan Brown, Marco Fiscato, Marwin H. S. Segler, and Alain C. Vaucher. Guacamol: Benchmarking models for de novo molecular design.J. Chem. Inf. Model., 59(3):1096–1108, 2019

2019
[17]

Learning to design protein- protein interactions with enhanced generalization

Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jir´ı Sedl´ar, Tom´as Pluskal, Jir´ı Damborsk´y, Stanislav Mazurenko, and Josef Sivic. Learning to design protein- protein interactions with enhanced generalization. InICLR. OpenReview.net, 2024

2024
[18]

Jaakkola

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi S. Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. InICML, Proceedings of Machine Learning Research, pages 5453–5512. PMLR / OpenReview.net, 2024

2024
[19]

PRESTO: progressive pretraining enhances synthetic chemistry outcomes

He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, and Yu Li. PRESTO: progressive pretraining enhances synthetic chemistry outcomes. InEMNLP (Findings), Findings of ACL, pages 10197– 10224. Association for Computational Linguistics, 2024

2024
[20]

Lifan Chen, Xiaoqin Tan, Dingyan Wang, Feisheng Zhong, Xiaohong Liu, Tianbiao Yang, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, Mingyue Zheng, and Arne Elofsson. Transformercpi: improving compound- protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments.Bioinform., 36(16):4406–4414, 2020

2020
[21]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[22]

Toward de novo protein design from natural language.BioRxiv, pages 2024–08, 2024

Fengyuan Dai, Shiyang You, Yudian Zhu, Yuan Gao, Lihao Fu, Xibin Zhou, Jin Su, Chentong Wang, Yuliang Fan, Xiaoxiao Ma, et al. Toward de novo protein design from natural language.BioRxiv, pages 2024–08, 2024

2024
[23]

Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

2022
[24]

Translation between molecules and natural language

Carl Edwards, Tuan Manh Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. InEMNLP, pages 375–413. Association for Computational Linguistics, 2022

2022
[25]

Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127, 2021

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127, 2021

2021
[26]

Interleaved tool-call reasoning for protein function understanding, 2026

Chuanliu Fan, Zicheng Ma, Huanran Meng, Aijia Zhang, Wenjie Du, Jun Zhang, Yi Qin Gao, Ziqiang Cao, and Guohong Fu. Interleaved tool-call reasoning for protein function understanding, 2026. URL https://arxiv.org/abs/2601.03604

arXiv 2026
[27]

Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

arXiv 2025
[28]

Mol-instructions: A large-scale biomolecular instruction dataset for large language models

Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. InICLR. OpenReview.net, 2024

2024
[29]

Domain-agnostic molecular generation with chemical feedback

Yin Fang, Ningyu Zhang, Zhuo Chen, Lingbing Guo, Xiaohui Fan, and Huajun Chen. Domain-agnostic molecular generation with chemical feedback. InICLR. OpenReview.net, 2024. 54 BioMatrix

2024
[30]

Prediction of membrane protein types based on the hydrophobic index of amino acids.Journal of protein chemistry, 19(4):269–275, 2000

Zhi-Ping Feng and Chun-Ting Zhang. Prediction of membrane protein types based on the hydrophobic index of amino acids.Journal of protein chemistry, 19(4):269–275, 2000

2000
[31]

Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B

Paul G. Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B. Iovanisci, Ian Snyder, and David Ryan Koes. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design.J. Chem. Inf. Model., 60(9):4200–4215, 2020

2020
[32]

Tokenizing 3d molecule structure with quantized spherical coordinates

Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John Hopcroft, Kun He, and Lijun Wu. Tokenizing 3d molecule structure with quantized spherical coordinates. InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 291–301, 2026

2026
[33]

Niklas W. A. Gebauer, Michael Gastegger, and Kristof Sch¨utt. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. InNeurIPS, pages 7564–7576, 2019

2019
[34]

Binding affinity training data set, 2021

J Glaser. Binding affinity training data set, 2021. URLhttps://huggingface.co/datasets/jglaser/binding affinity

2021
[35]

Gemini 2.5: Our most intelligent AI model

Google DeepMind. Gemini 2.5: Our most intelligent AI model. Google DeepMind Blog, March 2025. URL https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/ . Ac- cessed: 2025-08-12

2025
[36]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[37]

3d equivariant diffusion for target-aware molecule generation and affinity prediction

Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, and Jianzhu Ma. 3d equivariant diffusion for target-aware molecule generation and affinity prediction. InICLR. OpenReview.net, 2023

2023
[38]

Objective-reinforced generative adversarial networks (organ) for sequence generation models.arXiv preprint arXiv:1705.10843, 2017

Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, and Al´an Aspuru-Guzik. Objective-reinforced generative adversarial networks (organ) for sequence generation models.arXiv preprint arXiv:1705.10843, 2017

Pith/arXiv arXiv 2017
[39]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[40]

Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences.Nucleic acids research, 36(9): 3025–3030, 2008

Yanzhi Guo, Lezheng Yu, Zhining Wen, and Menglong Li. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences.Nucleic acids research, 36(9): 3025–3030, 2008

2008
[41]

J ¨urgen Haas, Alessandro Barbato, Dario Behringer, Gabriel Studer, Steven Roth, Martino Bertoni, Khaled Mostaguir, Rafal Gumienny, and Torsten Schwede. Continuous automated model evaluation (cameo) complementing the critical assessment of structure prediction in casp12.Proteins: Structure, Function, and Bioinformatics, 86:387–398, 2018

2018
[42]

Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

2025
[43]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR. OpenReview.net, 2021

2021
[44]

Equivariant diffusion for molecule generation in 3d

Emiel Hoogeboom, Victor Garcia Satorras, Cl´ement Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. InICML, Proceedings of Machine Learning Research, pages 8867–8887. PMLR, 2022

2022
[45]

OGB-LSC: A large-scale challenge for machine learning on graphs

Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. OGB-LSC: A large-scale challenge for machine learning on graphs. InNeurIPS Datasets and Benchmarks, 2021

2021
[46]

Conditional diffusion based on discrete graph structures for molecular graph generation

Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Conditional diffusion based on discrete graph structures for molecular graph generation. InAAAI, pages 4302–4311. AAAI Press, 2023

2023
[47]

Learning joint 2-d and 3-d graph diffusion models for complete molecule generation.IEEE Trans

Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Learning joint 2-d and 3-d graph diffusion models for complete molecule generation.IEEE Trans. Neural Networks Learn. Syst., 35(9):11857–11871, 2024. 55 BioMatrix

2024
[48]

MDM: molecular diffusion model for 3d molecule generation

Lei Huang, Hengtong Zhang, Tingyang Xu, and Ka-Chun Wong. MDM: molecular diffusion model for 3d molecule generation. InAAAI, pages 5105–5112. AAAI Press, 2023

2023
[49]

Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

Pith/arXiv arXiv 2024
[50]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[51]

Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, Vincent Frappier, Dana M Lord, Christopher Ng-Thow-Hing, Erik R Van Vlack, et al. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

2023
[52]

Dejun Jiang, Chang-Yu Hsieh, Zhenxing Wu, Yu Kang, Jike Wang, Ercheng Wang, Ben Liao, Chao Shen, Lei Xu, Jian Wu, et al. Interactiongraphnet: A novel and efficient deep graph representation learning framework for accurate protein–ligand interaction predictions.Journal of medicinal chemistry, 64(24):18209–18232, 2021

2021
[53]

Jaakkola

Wengong Jin, Regina Barzilay, and Tommi S. Jaakkola. Junction tree variational autoencoder for molecular graph generation. InICML, Proceedings of Machine Learning Research, pages 2328–2337. PMLR, 2018

2018
[54]

Pubchem 2025 update.Nucleic acids research, 53(D1):D1516–D1525, 2025

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2025 update.Nucleic acids research, 53(D1):D1516–D1525, 2025

2025
[55]

Kingma and Max Welling

Diederik P . Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

2014
[56]

Self- referencing embedded strings (SELFIES): A 100% robust molecular string representation.Mach

Mario Krenn, Florian H ¨ase, AkshatKumar Nigam, Pascal Friederich, and Al ´an Aspuru-Guzik. Self- referencing embedded strings (SELFIES): A 100% robust molecular string representation.Mach. Learn. Sci. Technol., 1(4):45024, 2020

2020
[57]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pages 611–626. ACM, 2023

2023
[58]

Compressed graph representation for scalable molecular graph generation.J

Youngchun Kwon, Dongseon Lee, Youn-Suk Choi, Kyoham Shin, and Seokho Kang. Compressed graph representation for scalable molecular graph generation.J. Cheminformatics, 12(1):58, 2020

2020
[59]

Sch ¨utt

Tuan Le, Julian Cremer, Frank No ´e, Djork-Arn ´e Clevert, and Kristof T. Sch ¨utt. Navigating the design space of equivariant diffusion-based generative models for de novo 3d molecule generation. InICLR. OpenReview.net, 2024

2024
[60]

Speak-to-structure: Evaluating llms in open-domain natural language-driven molecule generation

Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, and Qing Li. Speak-to-structure: Evaluating llms in open-domain natural language-driven molecule generation. arXiv preprint arXiv:2412.14642, 2024

Pith/arXiv arXiv 2024
[61]

Speaking the language of science: Toward a general-purpose generative foundation model for the natural sciences.arXiv preprint arXiv:2606.16905, 2026

Mingyang Li, Yurou Liu, Jieping Ye, Bing Su, Ji-Rong Wen, and Zheng Wang. Speaking the language of science: Toward a general-purpose generative foundation model for the natural sciences.arXiv preprint arXiv:2606.16905, 2026

arXiv 2026
[62]

Monn: a multi-objective neural network for predicting compound-protein interactions and affinities.Cell systems, 10(4):308–322, 2020

Shuya Li, Fangping Wan, Hantao Shu, Tao Jiang, Dan Zhao, and Jianyang Zeng. Monn: a multi-objective neural network for predicting compound-protein interactions and affinities.Cell systems, 10(4):308–322, 2020

2020
[63]

Towards 3d molecule-text interpretation in language models

Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. Towards 3d molecule-text interpretation in language models. InICLR. OpenReview.net, 2024

2024
[64]

Sc2mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer.Bioinform., 39(1), 2023

Zhirui Liao, Lei Xie, Hiroshi Mamitsuka, and Shanfeng Zhu. Sc2mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer.Bioinform., 39(1), 2023

2023
[65]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. 56 BioMatrix

2023
[66]

Protein design with dynamic protein vocabulary

Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, and Yuanbin Wu. Protein design with dynamic protein vocabulary. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems
[67]

A text-guided protein design framework.arXiv preprint arXiv:2302.04611, 2023

Shengchao Liu, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Anthony Gitter, Chaowei Xiao, Jian Tang, Hongyu Guo, and Anima Anandkumar. A text-guided protein design framework.arXiv preprint arXiv:2302.04611, 2023

arXiv 2023
[68]

Jorissen, and Michael K

Tiqing Liu, Yuhmei Lin, Xin Wen, Robert N. Jorissen, and Michael K. Gilson. Bindingdb: a web-accessible database of experimentally determined protein-ligand binding affinities.Nucleic Acids Res., 35(Database- Issue):198–201, 2007

2007
[69]

Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data

Tiqing Liu, Linda Hwang, Stephen K Burley, Carmen I Nitsche, Christopher Southan, W Patrick Walters, and Michael K Gilson. Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data. Nucleic acids research, 53(D1):D1633–D1644, 2025

2024
[70]

Forging the basis for developing protein–ligand interaction scoring functions.Accounts of chemical research, 50(2):302–309, 2017

Zhihai Liu, Minyi Su, Li Han, Jie Liu, Qifan Yang, Yan Li, and Renxiao Wang. Forging the basis for developing protein–ligand interaction scoring functions.Accounts of chemical research, 50(2):302–309, 2017

2017
[71]

Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter

Zhiyuan Liu, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, and Tat-Seng Chua. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In EMNLP, pages 15623–15638. Association for Computational Linguistics, 2023

2023
[72]

Prott3: Protein-to-text generation for text-based protein understanding

Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. Prott3: Protein-to-text generation for text-based protein understanding. InACL (1), pages 5949–5966. Association for Computational Linguistics, 2024

2024
[73]

Next-mol: 3d diffusion meets 1d language modeling for 3d molecule generation

Zhiyuan Liu, Yanchen Luo, Han Huang, Enzhi Zhang, Sihang Li, Junfeng Fang, Yaorui Shi, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. Next-mol: 3d diffusion meets 1d language modeling for 3d molecule generation. InICLR. OpenReview.net, 2025

2025
[74]

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S. Weld. S2ORC: the semantic scholar open research corpus. InACL, pages 4969–4983. Association for Computational Linguistics, 2020

2020
[75]

Chemical reactions from us patents, 2017

Daniel Lowe. Chemical reactions from us patents, 2017

2017
[76]

Fineweb-edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URLhttps://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

2024
[77]

Tankbind: Trigonometry- aware neural networks for drug-protein binding structure prediction

Wei Lu, Qifeng Wu, Jixian Zhang, Jiahua Rao, Chengtao Li, and Shuangjia Zheng. Tankbind: Trigonometry- aware neural networks for drug-protein binding structure prediction. InNeurIPS, 2022

2022
[78]

Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension

Xingyu Lu, He Cao, Zijing Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, and Yu Li. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. InEMNLP (Findings), Findings of ACL, pages 3769–3789. Association for Computational Linguistics, 2024

2024
[79]

Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

arXiv 2023
[80]

An autoregressive flow model for 3d molecular geometry generation from scratch

Youzhi Luo and Shuiwang Ji. An autoregressive flow model for 3d molecular geometry generation from scratch. InICLR. OpenReview.net, 2022

2022

Showing first 80 references.

[1] [1]

Uniprot: the universal protein knowledgebase in 2023.Nucleic acids research, 51(D1):D523–D531, 2023

2023

[2] [2]

Open-AlphaSeq: Open protein–protein interaction affinity datasets, 2025

A-Alpha Bio. Open-AlphaSeq: Open protein–protein interaction affinity datasets, 2025. URL https: //huggingface.co/datasets/aalphabio/open-alphaseq

2025

[3] [3]

Prot2text: Multi- modal protein’s function generation with gnns and transformers

Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, and Michalis Vazirgiannis. Prot2text: Multi- modal protein’s function generation with gnns and transformers. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 10757–10765, 2024

2024

[4] [4]

Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

2024

[5] [5]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[6] [6]

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

Pith/arXiv arXiv 2025

[7] [7]

Protein generation with evolutionary diffusion: sequence is all you need

Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Lu, Nicolo Fusi, Ava Amini, and Kevin Yang. Protein generation with evolutionary diffusion: sequence is all you need. InNeurIPS 2023 Generative AI and Biology (GenBio) Workshop

2023

[8] [8]

Claude 3.5 Sonnet model card addendum, 2024

Anthropic. Claude 3.5 Sonnet model card addendum, 2024. URL https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model Card Claude 3 Addendum.pdf

2024

[9] [9]

Claude Opus 4.6 system card

Anthropic. Claude Opus 4.6 system card. Technical report, Anthropic, February 2026. URL https: //www.anthropic.com/claude-opus-4-6-system-card

2026

[10] [10]

The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

2024

[11] [11]

Viraj Bagal, Rishal Aggarwal, P . K. Vinod, and U. Deva Priyakumar. Molgpt: Molecular generation using a transformer-decoder model.J. Chem. Inf. Model., 62(9):2064–2076, 2022

2064

[12] [12]

Equivariant energy-guided SDE for inverse molecular design

Fan Bao, Min Zhao, Zhongkai Hao, Peiyao Li, Chongxuan Li, and Jun Zhu. Equivariant energy-guided SDE for inverse molecular design. InICLR. OpenReview.net, 2023

2023

[13] [13]

The protein data bank.Nucleic acids research, 28(1):235–242, 2000

Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank.Nucleic acids research, 28(1):235–242, 2000. 53 BioMatrix

2000

[14] [14]

Alphafold protein structure database 2025: a redesigned interface and updated structural coverage.Nucleic Acids Research, 54 (D1):D358–D362, 2026

Damian Bertoni, Maxim Tsenkov, Paulyna Magana, Sreenath Nair, Ivanna Pidruchna, Marcelo Querino Lima Afonso, Adam Midlik, Urmila Paramval, Dare Lawal, Ahsan Tanweer, et al. Alphafold protein structure database 2025: a redesigned interface and updated structural coverage.Nucleic Acids Research, 54 (D1):D358–D362, 2026

2025

[15] [15]

Bronstein, and Alexander Tong

Avishek Joey Bose, Tara Akhound-Sadegh, Guillaume Huguet, Kilian Fatras, Jarrid Rector-Brooks, Cheng- Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael M. Bronstein, and Alexander Tong. Se(3)- stochastic flow matching for protein backbone generation. InICLR. OpenReview.net, 2024

2024

[16] [16]

Nathan Brown, Marco Fiscato, Marwin H. S. Segler, and Alain C. Vaucher. Guacamol: Benchmarking models for de novo molecular design.J. Chem. Inf. Model., 59(3):1096–1108, 2019

2019

[17] [17]

Learning to design protein- protein interactions with enhanced generalization

Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jir´ı Sedl´ar, Tom´as Pluskal, Jir´ı Damborsk´y, Stanislav Mazurenko, and Josef Sivic. Learning to design protein- protein interactions with enhanced generalization. InICLR. OpenReview.net, 2024

2024

[18] [18]

Jaakkola

Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi S. Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. InICML, Proceedings of Machine Learning Research, pages 5453–5512. PMLR / OpenReview.net, 2024

2024

[19] [19]

PRESTO: progressive pretraining enhances synthetic chemistry outcomes

He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, and Yu Li. PRESTO: progressive pretraining enhances synthetic chemistry outcomes. InEMNLP (Findings), Findings of ACL, pages 10197– 10224. Association for Computational Linguistics, 2024

2024

[20] [20]

Lifan Chen, Xiaoqin Tan, Dingyan Wang, Feisheng Zhong, Xiaohong Liu, Tianbiao Yang, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, Mingyue Zheng, and Arne Elofsson. Transformercpi: improving compound- protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments.Bioinform., 36(16):4406–4414, 2020

2020

[21] [21]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[22] [22]

Toward de novo protein design from natural language.BioRxiv, pages 2024–08, 2024

Fengyuan Dai, Shiyang You, Yudian Zhu, Yuan Gao, Lihao Fu, Xibin Zhou, Jin Su, Chentong Wang, Yuliang Fan, Xiaoxiao Ma, et al. Toward de novo protein design from natural language.BioRxiv, pages 2024–08, 2024

2024

[23] [23]

Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

2022

[24] [24]

Translation between molecules and natural language

Carl Edwards, Tuan Manh Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. InEMNLP, pages 375–413. Association for Computational Linguistics, 2022

2022

[25] [25]

Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127, 2021

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127, 2021

2021

[26] [26]

Interleaved tool-call reasoning for protein function understanding, 2026

Chuanliu Fan, Zicheng Ma, Huanran Meng, Aijia Zhang, Wenjie Du, Jun Zhang, Yi Qin Gao, Ziqiang Cao, and Guohong Fu. Interleaved tool-call reasoning for protein function understanding, 2026. URL https://arxiv.org/abs/2601.03604

arXiv 2026

[27] [27]

Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

arXiv 2025

[28] [28]

Mol-instructions: A large-scale biomolecular instruction dataset for large language models

Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. InICLR. OpenReview.net, 2024

2024

[29] [29]

Domain-agnostic molecular generation with chemical feedback

Yin Fang, Ningyu Zhang, Zhuo Chen, Lingbing Guo, Xiaohui Fan, and Huajun Chen. Domain-agnostic molecular generation with chemical feedback. InICLR. OpenReview.net, 2024. 54 BioMatrix

2024

[30] [30]

Prediction of membrane protein types based on the hydrophobic index of amino acids.Journal of protein chemistry, 19(4):269–275, 2000

Zhi-Ping Feng and Chun-Ting Zhang. Prediction of membrane protein types based on the hydrophobic index of amino acids.Journal of protein chemistry, 19(4):269–275, 2000

2000

[31] [31]

Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B

Paul G. Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B. Iovanisci, Ian Snyder, and David Ryan Koes. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design.J. Chem. Inf. Model., 60(9):4200–4215, 2020

2020

[32] [32]

Tokenizing 3d molecule structure with quantized spherical coordinates

Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John Hopcroft, Kun He, and Lijun Wu. Tokenizing 3d molecule structure with quantized spherical coordinates. InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 291–301, 2026

2026

[33] [33]

Niklas W. A. Gebauer, Michael Gastegger, and Kristof Sch¨utt. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. InNeurIPS, pages 7564–7576, 2019

2019

[34] [34]

Binding affinity training data set, 2021

J Glaser. Binding affinity training data set, 2021. URLhttps://huggingface.co/datasets/jglaser/binding affinity

2021

[35] [35]

Gemini 2.5: Our most intelligent AI model

Google DeepMind. Gemini 2.5: Our most intelligent AI model. Google DeepMind Blog, March 2025. URL https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/ . Ac- cessed: 2025-08-12

2025

[36] [36]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[37] [37]

3d equivariant diffusion for target-aware molecule generation and affinity prediction

Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, and Jianzhu Ma. 3d equivariant diffusion for target-aware molecule generation and affinity prediction. InICLR. OpenReview.net, 2023

2023

[38] [38]

Objective-reinforced generative adversarial networks (organ) for sequence generation models.arXiv preprint arXiv:1705.10843, 2017

Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, and Al´an Aspuru-Guzik. Objective-reinforced generative adversarial networks (organ) for sequence generation models.arXiv preprint arXiv:1705.10843, 2017

Pith/arXiv arXiv 2017

[39] [39]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[40] [40]

Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences.Nucleic acids research, 36(9): 3025–3030, 2008

Yanzhi Guo, Lezheng Yu, Zhining Wen, and Menglong Li. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences.Nucleic acids research, 36(9): 3025–3030, 2008

2008

[41] [41]

J ¨urgen Haas, Alessandro Barbato, Dario Behringer, Gabriel Studer, Steven Roth, Martino Bertoni, Khaled Mostaguir, Rafal Gumienny, and Torsten Schwede. Continuous automated model evaluation (cameo) complementing the critical assessment of structure prediction in casp12.Proteins: Structure, Function, and Bioinformatics, 86:387–398, 2018

2018

[42] [42]

Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

2025

[43] [43]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR. OpenReview.net, 2021

2021

[44] [44]

Equivariant diffusion for molecule generation in 3d

Emiel Hoogeboom, Victor Garcia Satorras, Cl´ement Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. InICML, Proceedings of Machine Learning Research, pages 8867–8887. PMLR, 2022

2022

[45] [45]

OGB-LSC: A large-scale challenge for machine learning on graphs

Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. OGB-LSC: A large-scale challenge for machine learning on graphs. InNeurIPS Datasets and Benchmarks, 2021

2021

[46] [46]

Conditional diffusion based on discrete graph structures for molecular graph generation

Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Conditional diffusion based on discrete graph structures for molecular graph generation. InAAAI, pages 4302–4311. AAAI Press, 2023

2023

[47] [47]

Learning joint 2-d and 3-d graph diffusion models for complete molecule generation.IEEE Trans

Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Learning joint 2-d and 3-d graph diffusion models for complete molecule generation.IEEE Trans. Neural Networks Learn. Syst., 35(9):11857–11871, 2024. 55 BioMatrix

2024

[48] [48]

MDM: molecular diffusion model for 3d molecule generation

Lei Huang, Hengtong Zhang, Tingyang Xu, and Ka-Chun Wong. MDM: molecular diffusion model for 3d molecule generation. InAAAI, pages 5105–5112. AAAI Press, 2023

2023

[49] [49]

Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

Pith/arXiv arXiv 2024

[50] [50]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[51] [51]

Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, Vincent Frappier, Dana M Lord, Christopher Ng-Thow-Hing, Erik R Van Vlack, et al. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

2023

[52] [52]

Dejun Jiang, Chang-Yu Hsieh, Zhenxing Wu, Yu Kang, Jike Wang, Ercheng Wang, Ben Liao, Chao Shen, Lei Xu, Jian Wu, et al. Interactiongraphnet: A novel and efficient deep graph representation learning framework for accurate protein–ligand interaction predictions.Journal of medicinal chemistry, 64(24):18209–18232, 2021

2021

[53] [53]

Jaakkola

Wengong Jin, Regina Barzilay, and Tommi S. Jaakkola. Junction tree variational autoencoder for molecular graph generation. InICML, Proceedings of Machine Learning Research, pages 2328–2337. PMLR, 2018

2018

[54] [54]

Pubchem 2025 update.Nucleic acids research, 53(D1):D1516–D1525, 2025

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2025 update.Nucleic acids research, 53(D1):D1516–D1525, 2025

2025

[55] [55]

Kingma and Max Welling

Diederik P . Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

2014

[56] [56]

Self- referencing embedded strings (SELFIES): A 100% robust molecular string representation.Mach

Mario Krenn, Florian H ¨ase, AkshatKumar Nigam, Pascal Friederich, and Al ´an Aspuru-Guzik. Self- referencing embedded strings (SELFIES): A 100% robust molecular string representation.Mach. Learn. Sci. Technol., 1(4):45024, 2020

2020

[57] [57]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pages 611–626. ACM, 2023

2023

[58] [58]

Compressed graph representation for scalable molecular graph generation.J

Youngchun Kwon, Dongseon Lee, Youn-Suk Choi, Kyoham Shin, and Seokho Kang. Compressed graph representation for scalable molecular graph generation.J. Cheminformatics, 12(1):58, 2020

2020

[59] [59]

Sch ¨utt

Tuan Le, Julian Cremer, Frank No ´e, Djork-Arn ´e Clevert, and Kristof T. Sch ¨utt. Navigating the design space of equivariant diffusion-based generative models for de novo 3d molecule generation. InICLR. OpenReview.net, 2024

2024

[60] [60]

Speak-to-structure: Evaluating llms in open-domain natural language-driven molecule generation

Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, and Qing Li. Speak-to-structure: Evaluating llms in open-domain natural language-driven molecule generation. arXiv preprint arXiv:2412.14642, 2024

Pith/arXiv arXiv 2024

[61] [61]

Speaking the language of science: Toward a general-purpose generative foundation model for the natural sciences.arXiv preprint arXiv:2606.16905, 2026

Mingyang Li, Yurou Liu, Jieping Ye, Bing Su, Ji-Rong Wen, and Zheng Wang. Speaking the language of science: Toward a general-purpose generative foundation model for the natural sciences.arXiv preprint arXiv:2606.16905, 2026

arXiv 2026

[62] [62]

Monn: a multi-objective neural network for predicting compound-protein interactions and affinities.Cell systems, 10(4):308–322, 2020

Shuya Li, Fangping Wan, Hantao Shu, Tao Jiang, Dan Zhao, and Jianyang Zeng. Monn: a multi-objective neural network for predicting compound-protein interactions and affinities.Cell systems, 10(4):308–322, 2020

2020

[63] [63]

Towards 3d molecule-text interpretation in language models

Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. Towards 3d molecule-text interpretation in language models. InICLR. OpenReview.net, 2024

2024

[64] [64]

Sc2mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer.Bioinform., 39(1), 2023

Zhirui Liao, Lei Xie, Hiroshi Mamitsuka, and Shanfeng Zhu. Sc2mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer.Bioinform., 39(1), 2023

2023

[65] [65]

Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. 56 BioMatrix

2023

[66] [66]

Protein design with dynamic protein vocabulary

Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, and Yuanbin Wu. Protein design with dynamic protein vocabulary. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

[67] [67]

A text-guided protein design framework.arXiv preprint arXiv:2302.04611, 2023

Shengchao Liu, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Anthony Gitter, Chaowei Xiao, Jian Tang, Hongyu Guo, and Anima Anandkumar. A text-guided protein design framework.arXiv preprint arXiv:2302.04611, 2023

arXiv 2023

[68] [68]

Jorissen, and Michael K

Tiqing Liu, Yuhmei Lin, Xin Wen, Robert N. Jorissen, and Michael K. Gilson. Bindingdb: a web-accessible database of experimentally determined protein-ligand binding affinities.Nucleic Acids Res., 35(Database- Issue):198–201, 2007

2007

[69] [69]

Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data

Tiqing Liu, Linda Hwang, Stephen K Burley, Carmen I Nitsche, Christopher Southan, W Patrick Walters, and Michael K Gilson. Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data. Nucleic acids research, 53(D1):D1633–D1644, 2025

2024

[70] [70]

Forging the basis for developing protein–ligand interaction scoring functions.Accounts of chemical research, 50(2):302–309, 2017

Zhihai Liu, Minyi Su, Li Han, Jie Liu, Qifan Yang, Yan Li, and Renxiao Wang. Forging the basis for developing protein–ligand interaction scoring functions.Accounts of chemical research, 50(2):302–309, 2017

2017

[71] [71]

Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter

Zhiyuan Liu, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, and Tat-Seng Chua. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In EMNLP, pages 15623–15638. Association for Computational Linguistics, 2023

2023

[72] [72]

Prott3: Protein-to-text generation for text-based protein understanding

Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. Prott3: Protein-to-text generation for text-based protein understanding. InACL (1), pages 5949–5966. Association for Computational Linguistics, 2024

2024

[73] [73]

Next-mol: 3d diffusion meets 1d language modeling for 3d molecule generation

Zhiyuan Liu, Yanchen Luo, Han Huang, Enzhi Zhang, Sihang Li, Junfeng Fang, Yaorui Shi, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. Next-mol: 3d diffusion meets 1d language modeling for 3d molecule generation. InICLR. OpenReview.net, 2025

2025

[74] [74]

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S. Weld. S2ORC: the semantic scholar open research corpus. InACL, pages 4969–4983. Association for Computational Linguistics, 2020

2020

[75] [75]

Chemical reactions from us patents, 2017

Daniel Lowe. Chemical reactions from us patents, 2017

2017

[76] [76]

Fineweb-edu: the finest collection of educational content, 2024

Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URLhttps://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

2024

[77] [77]

Tankbind: Trigonometry- aware neural networks for drug-protein binding structure prediction

Wei Lu, Qifeng Wu, Jixian Zhang, Jiahua Rao, Chengtao Li, and Shuangjia Zheng. Tankbind: Trigonometry- aware neural networks for drug-protein binding structure prediction. InNeurIPS, 2022

2022

[78] [78]

Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension

Xingyu Lu, He Cao, Zijing Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, and Yu Li. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. InEMNLP (Findings), Findings of ACL, pages 3769–3789. Association for Computational Linguistics, 2024

2024

[79] [79]

Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

arXiv 2023

[80] [80]

An autoregressive flow model for 3d molecular geometry generation from scratch

Youzhi Luo and Shuiwang Ji. An autoregressive flow model for 3d molecular geometry generation from scratch. InICLR. OpenReview.net, 2022

2022