pith. sign in

arxiv: 2606.22138 · v1 · pith:SQ4L2XPCnew · submitted 2026-06-20 · 💻 cs.CL · cs.AI· cs.LG· q-bio.BM

BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language

Pith reviewed 2026-06-26 11:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGq-bio.BM
keywords multimodal foundation modelbiological sequencesmolecular structuresprotein structuresunified tokenizationnext-token predictionmolecules and proteinsdecoder-only architecture
0
0 comments X

The pith

A unified tokenization scheme maps molecular sequences, structures, protein sequences, structures and natural language into one shared space so a single decoder-only model can consume and generate all of them under next-token prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a foundation model that brings sequences, structures and text for both molecules and proteins into the same architecture. It does this by converting every input type into tokens from one vocabulary, allowing the model to read and write any combination through ordinary next-token prediction. No separate encoders, adapters or output heads are required for different modalities. If the approach holds, it would mean a generalist model can replace collections of specialized tools across understanding and generation tasks in biology. The work reports competitive or leading results on most of a large benchmark suite covering single-entity and cross-entity problems.

Core claim

BioMatrix is the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. It achieves this by mapping molecular sequences (SMILES and SELFIES), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective without external encoders, projection adapters, or modality-specific output heads.

What carries the argument

The unified tokenization scheme that converts every modality into tokens from a single vocabulary, enabling uniform next-token prediction across sequences, structures and language.

If this is right

  • The model reaches state-of-the-art or competitive results on 77 of 80 tasks spanning understanding and generation within and across modalities.
  • Both single-entity and multi-entity tasks, including molecule-protein and protein-protein interactions, become addressable inside one model.
  • Continual pretraining on hundreds of billions of tokens that mix text, sequences, structures and cross-modal pairs produces the observed breadth of capability.
  • Specialized models are no longer required for most of the covered biological tasks once the unified token space is in place.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenization logic could be tested on additional entity types such as metabolites or genes to check whether the generalist pattern extends further.
  • Deployment cost may drop because one set of weights replaces multiple modality-specific models and their associated adapters.
  • Zero-shot transfer between previously separate tasks, such as generating a protein structure directly from a textual description of its function, becomes a natural next measurement.
  • If the token space preserves enough geometric detail, downstream simulation or docking tools could operate directly on the model's generated tokens.

Load-bearing premise

A single discrete tokenization scheme can represent both sequence and three-dimensional structural information for molecules and proteins without meaningful loss or the need for modality-specific components.

What would settle it

Demonstration that the model cannot produce chemically valid molecular structures or biologically plausible protein folds when asked to generate from sequence tokens alone, or that removing the structure tokens from training collapses performance on structure-related tasks.

Figures

Figures reproduced from arXiv: 2606.22138 by Chang-Yu Hsieh, Chengping Li, Conghui He, Han Guo, Liang He, Lijun Wu, Qizhi Pei, Rui Yan, Wei Li, Yi Duan, Yiyang Zhao, Zhimeng Zhou.

Figure 1
Figure 1. Figure 1: Overview of BioMatrix. Natural language, molecular line notations (SMILES, SELFIES), [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 3D structure tokenization pipelines in BioMatrix. The molecule structure tokenizer converts each 3D conformer into SELFIES-aligned joint 1D–3D tokens through local spherical descriptors and a 512-entry vector-quantized codebook, while the protein structure tokenizer encodes backbone geometry into per-residue structure tokens using a GCPNet-based encoder and a 4096-entry codebook. Dashed arrows indicate the… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the continual pretraining corpus composition. Token budget distribution across [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training loss curves of BioMatrix-1.7B and BioMatrix-4B during multimodal continual [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of the molecular task suite evaluated in this section. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the protein task suite evaluated in this section. [PITH_FULL_IMAGE:figures/full_fig_p033_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Relative gain from scaling BioMatrix from 1.7B to 4B. The score compares only the two [PITH_FULL_IMAGE:figures/full_fig_p050_7.png] view at source ↗
read the original abstract

We present BioMatrix, the first multimodal foundation model that natively integrates sequences, structures, and natural language for both molecules and proteins within a single decoder-only architecture. Existing biological foundation models pursue native multimodality and broad entity coverage separately: those that fuse multiple modalities under a shared objective remain confined to a single entity type, while those spanning multiple entity types either omit explicit structural modeling or rely on adapter-based designs in which the model cannot natively generate the very modalities it can read. BioMatrix closes this gap by mapping molecular sequences (supporting both SMILES and SELFIES notations), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space through a unified tokenization scheme, so that all modalities are consumed and produced uniformly under a single next-token prediction objective -- without external encoders, projection adapters, or modality-specific output heads. Built upon the Qwen3 language model (1.7B and 4B), BioMatrix is continually pretrained on 304.4 billion tokens spanning general and domain-specific text, sequence and structure views of molecules and proteins, and cross-modal corpora that interleave biomolecular entities with scientific text and link distinct entities through molecule-protein and protein-protein interaction data. After tuning on a comprehensive suite of downstream applications covering 80 tasks across 6 categories -- encompassing single-entity and multi-entity understanding and generation tasks across and within modalities -- BioMatrix achieves state-of-the-art or competitive performance on 77 out of 80 tasks, demonstrating that a single, natively multimodal generalist model can effectively match or surpass specialized approaches across a wide range of biological tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces BioMatrix, a decoder-only foundation model built on Qwen3 (1.7B/4B) that maps molecular sequences (SMILES/SELFIES), molecular structures, protein sequences, protein structures, and natural language into a shared discrete token space via a unified tokenization scheme. All modalities are handled uniformly under next-token prediction pretraining on 304.4B tokens (including cross-modal interaction data), followed by tuning on 80 tasks across 6 categories of single- and multi-entity understanding/generation; it reports SOTA or competitive results on 77/80 tasks.

Significance. If the unified tokenization supports lossless native generation of 3D structures under pure next-token prediction and the performance results are shown to be robust, this would be a notable contribution by demonstrating that a single generalist decoder-only model can span multiple biological entity types and modalities without adapters or modality-specific heads, potentially simplifying the landscape of biological foundation models.

major comments (2)
  1. [Abstract] Abstract: the central claim that BioMatrix 'achieves state-of-the-art or competitive performance on 77 out of 80 tasks' supplies no information on task definitions, baselines, data splits, statistical significance testing, or how structures are tokenized and generated; without these, the performance claim cannot be evaluated and is load-bearing for the paper's main result.
  2. [Abstract] Abstract (and the section describing the unified tokenization scheme): no equations, pseudocode, or algorithmic detail is given for discretizing 3D molecular/protein structures (coordinates or graphs) into the shared vocabulary alongside SMILES/SELFIES and text; this leaves open whether generation remains purely next-token prediction without information loss or implicit modality-specific logic, which directly underpins the 'natively integrates ... without external encoders, projection adapters, or modality-specific output heads' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below, indicating where revisions will be made to improve clarity and evaluability while preserving the manuscript's core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that BioMatrix 'achieves state-of-the-art or competitive performance on 77 out of 80 tasks' supplies no information on task definitions, baselines, data splits, statistical significance testing, or how structures are tokenized and generated; without these, the performance claim cannot be evaluated and is load-bearing for the paper's main result.

    Authors: We agree the abstract is high-level by design. Full details on the 80 tasks (definitions, baselines, splits, and significance testing) appear in Section 4 and the supplementary material; structure tokenization/generation is covered in Section 3. To address evaluability concerns, we will revise the abstract to briefly reference the evaluation protocol and direct readers to the relevant sections for complete information. revision: partial

  2. Referee: [Abstract] Abstract (and the section describing the unified tokenization scheme): no equations, pseudocode, or algorithmic detail is given for discretizing 3D molecular/protein structures (coordinates or graphs) into the shared vocabulary alongside SMILES/SELFIES and text; this leaves open whether generation remains purely next-token prediction without information loss or implicit modality-specific logic, which directly underpins the 'natively integrates ... without external encoders, projection adapters, or modality-specific output heads' claim.

    Authors: We acknowledge the need for greater explicitness. While Section 3 describes the unified tokenization scheme that places all modalities (including 3D structures) into a shared discrete vocabulary for uniform next-token prediction, we will add equations and pseudocode for the 3D discretization process to the main text. This will explicitly confirm the absence of information loss or modality-specific logic. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on downstream evaluation

full rationale

The paper describes an empirical multimodal foundation model built on Qwen3 with a unified tokenization scheme for sequences, structures, and language. No equations, derivations, or parameter-fitting steps are presented that would reduce any claimed prediction or result to inputs by construction. Performance on 80 tasks is reported as external validation rather than a quantity defined by the pretraining objective. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way within the provided text. The architecture claim is an engineering choice evaluated empirically, not a self-referential derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven premise that a unified discrete tokenization of structures is lossless enough for generation and that next-token prediction alone suffices for cross-modal alignment; no independent evidence for these modeling choices is supplied in the abstract.

free parameters (1)
  • base model sizes
    1.7B and 4B parameter variants chosen from Qwen3; these are architectural decisions that affect capacity but are not derived from the biological data.
axioms (1)
  • domain assumption Next-token prediction on a shared discrete token space is sufficient to learn and generate across sequences, structures, and language without modality-specific components.
    Invoked in the description of the unified tokenization scheme and single-objective training.

pith-pipeline@v0.9.1-grok · 5879 in / 1489 out tokens · 26890 ms · 2026-06-26T11:44:39.629520+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

155 extracted references · 1 canonical work pages

  1. [1]

    Uniprot: the universal protein knowledgebase in 2023.Nucleic acids research, 51(D1):D523–D531, 2023

  2. [2]

    Open-AlphaSeq: Open protein–protein interaction affinity datasets, 2025

    A-Alpha Bio. Open-AlphaSeq: Open protein–protein interaction affinity datasets, 2025. URL https: //huggingface.co/datasets/aalphabio/open-alphaseq

  3. [3]

    Prot2text: Multi- modal protein’s function generation with gnns and transformers

    Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, and Michalis Vazirgiannis. Prot2text: Multi- modal protein’s function generation with gnns and transformers. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 10757–10765, 2024

  4. [4]

    Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

    Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

  5. [5]

    Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  6. [6]

    gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  7. [7]

    Protein generation with evolutionary diffusion: sequence is all you need

    Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Lu, Nicolo Fusi, Ava Amini, and Kevin Yang. Protein generation with evolutionary diffusion: sequence is all you need. InNeurIPS 2023 Generative AI and Biology (GenBio) Workshop

  8. [8]

    Claude 3.5 Sonnet model card addendum, 2024

    Anthropic. Claude 3.5 Sonnet model card addendum, 2024. URL https://www-cdn.anthropic.com/ fed9cc193a14b84131812372d8d5857f8f304c52/Model Card Claude 3 Addendum.pdf

  9. [9]

    Claude Opus 4.6 system card

    Anthropic. Claude Opus 4.6 system card. Technical report, Anthropic, February 2026. URL https: //www.anthropic.com/claude-opus-4-6-system-card

  10. [10]

    The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

    AI Anthropic. The claude 3 model family: Opus, sonnet, haiku.Claude-3 Model Card, 1(1):4, 2024

  11. [11]

    Viraj Bagal, Rishal Aggarwal, P . K. Vinod, and U. Deva Priyakumar. Molgpt: Molecular generation using a transformer-decoder model.J. Chem. Inf. Model., 62(9):2064–2076, 2022

  12. [12]

    Equivariant energy-guided SDE for inverse molecular design

    Fan Bao, Min Zhao, Zhongkai Hao, Peiyao Li, Chongxuan Li, and Jun Zhu. Equivariant energy-guided SDE for inverse molecular design. InICLR. OpenReview.net, 2023

  13. [13]

    The protein data bank.Nucleic acids research, 28(1):235–242, 2000

    Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank.Nucleic acids research, 28(1):235–242, 2000. 53 BioMatrix

  14. [14]

    Alphafold protein structure database 2025: a redesigned interface and updated structural coverage.Nucleic Acids Research, 54 (D1):D358–D362, 2026

    Damian Bertoni, Maxim Tsenkov, Paulyna Magana, Sreenath Nair, Ivanna Pidruchna, Marcelo Querino Lima Afonso, Adam Midlik, Urmila Paramval, Dare Lawal, Ahsan Tanweer, et al. Alphafold protein structure database 2025: a redesigned interface and updated structural coverage.Nucleic Acids Research, 54 (D1):D358–D362, 2026

  15. [15]

    Bronstein, and Alexander Tong

    Avishek Joey Bose, Tara Akhound-Sadegh, Guillaume Huguet, Kilian Fatras, Jarrid Rector-Brooks, Cheng- Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael M. Bronstein, and Alexander Tong. Se(3)- stochastic flow matching for protein backbone generation. InICLR. OpenReview.net, 2024

  16. [16]

    Nathan Brown, Marco Fiscato, Marwin H. S. Segler, and Alain C. Vaucher. Guacamol: Benchmarking models for de novo molecular design.J. Chem. Inf. Model., 59(3):1096–1108, 2019

  17. [17]

    Learning to design protein- protein interactions with enhanced generalization

    Anton Bushuiev, Roman Bushuiev, Petr Kouba, Anatolii Filkin, Marketa Gabrielova, Michal Gabriel, Jir´ı Sedl´ar, Tom´as Pluskal, Jir´ı Damborsk´y, Stanislav Mazurenko, and Josef Sivic. Learning to design protein- protein interactions with enhanced generalization. InICLR. OpenReview.net, 2024

  18. [18]

    Jaakkola

    Andrew Campbell, Jason Yim, Regina Barzilay, Tom Rainforth, and Tommi S. Jaakkola. Generative flows on discrete state-spaces: Enabling multimodal flows with applications to protein co-design. InICML, Proceedings of Machine Learning Research, pages 5453–5512. PMLR / OpenReview.net, 2024

  19. [19]

    PRESTO: progressive pretraining enhances synthetic chemistry outcomes

    He Cao, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, and Yu Li. PRESTO: progressive pretraining enhances synthetic chemistry outcomes. InEMNLP (Findings), Findings of ACL, pages 10197– 10224. Association for Computational Linguistics, 2024

  20. [20]

    Lifan Chen, Xiaoqin Tan, Dingyan Wang, Feisheng Zhong, Xiaohong Liu, Tianbiao Yang, Xiaomin Luo, Kaixian Chen, Hualiang Jiang, Mingyue Zheng, and Arne Elofsson. Transformercpi: improving compound- protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments.Bioinform., 36(16):4406–4414, 2020

  21. [21]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  22. [22]

    Toward de novo protein design from natural language.BioRxiv, pages 2024–08, 2024

    Fengyuan Dai, Shiyang You, Yudian Zhu, Yuan Gao, Lihao Fu, Xibin Zhou, Jin Su, Chentong Wang, Yuliang Fan, Xiaoxiao Ma, et al. Toward de novo protein design from natural language.BioRxiv, pages 2024–08, 2024

  23. [23]

    Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

    Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

  24. [24]

    Translation between molecules and natural language

    Carl Edwards, Tuan Manh Lai, Kevin Ros, Garrett Honke, Kyunghyun Cho, and Heng Ji. Translation between molecules and natural language. InEMNLP, pages 375–413. Association for Computational Linguistics, 2022

  25. [25]

    Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127, 2021

    Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. Prottrans: toward understanding the language of life through self-supervised learning.IEEE transactions on pattern analysis and machine intelligence, 44(10): 7112–7127, 2021

  26. [26]

    Interleaved tool-call reasoning for protein function understanding, 2026

    Chuanliu Fan, Zicheng Ma, Huanran Meng, Aijia Zhang, Wenjie Du, Jun Zhang, Yi Qin Gao, Ziqiang Cao, and Guohong Fu. Interleaved tool-call reasoning for protein function understanding, 2026. URL https://arxiv.org/abs/2601.03604

  27. [27]

    Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

    Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasets for science reasoning.arXiv preprint arXiv:2507.16812, 2025

  28. [28]

    Mol-instructions: A large-scale biomolecular instruction dataset for large language models

    Yin Fang, Xiaozhuan Liang, Ningyu Zhang, Kangwei Liu, Rui Huang, Zhuo Chen, Xiaohui Fan, and Huajun Chen. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. InICLR. OpenReview.net, 2024

  29. [29]

    Domain-agnostic molecular generation with chemical feedback

    Yin Fang, Ningyu Zhang, Zhuo Chen, Lingbing Guo, Xiaohui Fan, and Huajun Chen. Domain-agnostic molecular generation with chemical feedback. InICLR. OpenReview.net, 2024. 54 BioMatrix

  30. [30]

    Prediction of membrane protein types based on the hydrophobic index of amino acids.Journal of protein chemistry, 19(4):269–275, 2000

    Zhi-Ping Feng and Chun-Ting Zhang. Prediction of membrane protein types based on the hydrophobic index of amino acids.Journal of protein chemistry, 19(4):269–275, 2000

  31. [31]

    Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B

    Paul G. Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B. Iovanisci, Ian Snyder, and David Ryan Koes. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design.J. Chem. Inf. Model., 60(9):4200–4215, 2020

  32. [32]

    Tokenizing 3d molecule structure with quantized spherical coordinates

    Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John Hopcroft, Kun He, and Lijun Wu. Tokenizing 3d molecule structure with quantized spherical coordinates. InProceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, pages 291–301, 2026

  33. [33]

    Niklas W. A. Gebauer, Michael Gastegger, and Kristof Sch¨utt. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. InNeurIPS, pages 7564–7576, 2019

  34. [34]

    Binding affinity training data set, 2021

    J Glaser. Binding affinity training data set, 2021. URLhttps://huggingface.co/datasets/jglaser/binding affinity

  35. [35]

    Gemini 2.5: Our most intelligent AI model

    Google DeepMind. Gemini 2.5: Our most intelligent AI model. Google DeepMind Blog, March 2025. URL https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/ . Ac- cessed: 2025-08-12

  36. [36]

    The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  37. [37]

    3d equivariant diffusion for target-aware molecule generation and affinity prediction

    Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, and Jianzhu Ma. 3d equivariant diffusion for target-aware molecule generation and affinity prediction. InICLR. OpenReview.net, 2023

  38. [38]

    Objective-reinforced generative adversarial networks (organ) for sequence generation models.arXiv preprint arXiv:1705.10843, 2017

    Gabriel Lima Guimaraes, Benjamin Sanchez-Lengeling, Carlos Outeiral, Pedro Luis Cunha Farias, and Al´an Aspuru-Guzik. Objective-reinforced generative adversarial networks (organ) for sequence generation models.arXiv preprint arXiv:1705.10843, 2017

  39. [39]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  40. [40]

    Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences.Nucleic acids research, 36(9): 3025–3030, 2008

    Yanzhi Guo, Lezheng Yu, Zhining Wen, and Menglong Li. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences.Nucleic acids research, 36(9): 3025–3030, 2008

  41. [41]

    J ¨urgen Haas, Alessandro Barbato, Dario Behringer, Gabriel Studer, Steven Roth, Martino Bertoni, Khaled Mostaguir, Rafal Gumienny, and Torsten Schwede. Continuous automated model evaluation (cameo) complementing the critical assessment of structure prediction in casp12.Proteins: Structure, Function, and Bioinformatics, 86:387–398, 2018

  42. [42]

    Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

    Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model.Science, 387(6736):850–858, 2025

  43. [43]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InICLR. OpenReview.net, 2021

  44. [44]

    Equivariant diffusion for molecule generation in 3d

    Emiel Hoogeboom, Victor Garcia Satorras, Cl´ement Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. InICML, Proceedings of Machine Learning Research, pages 8867–8887. PMLR, 2022

  45. [45]

    OGB-LSC: A large-scale challenge for machine learning on graphs

    Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. OGB-LSC: A large-scale challenge for machine learning on graphs. InNeurIPS Datasets and Benchmarks, 2021

  46. [46]

    Conditional diffusion based on discrete graph structures for molecular graph generation

    Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Conditional diffusion based on discrete graph structures for molecular graph generation. InAAAI, pages 4302–4311. AAAI Press, 2023

  47. [47]

    Learning joint 2-d and 3-d graph diffusion models for complete molecule generation.IEEE Trans

    Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Learning joint 2-d and 3-d graph diffusion models for complete molecule generation.IEEE Trans. Neural Networks Learn. Syst., 35(9):11857–11871, 2024. 55 BioMatrix

  48. [48]

    MDM: molecular diffusion model for 3d molecule generation

    Lei Huang, Hengtong Zhang, Tingyang Xu, and Ka-Chun Wong. MDM: molecular diffusion model for 3d molecule generation. InAAAI, pages 5105–5112. AAAI Press, 2023

  49. [49]

    Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.CoRR, abs/2409.12186, 2024

  50. [50]

    Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  51. [51]

    Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

    John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang, Ahmed Ismail, Vincent Frappier, Dana M Lord, Christopher Ng-Thow-Hing, Erik R Van Vlack, et al. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

  52. [52]

    Dejun Jiang, Chang-Yu Hsieh, Zhenxing Wu, Yu Kang, Jike Wang, Ercheng Wang, Ben Liao, Chao Shen, Lei Xu, Jian Wu, et al. Interactiongraphnet: A novel and efficient deep graph representation learning framework for accurate protein–ligand interaction predictions.Journal of medicinal chemistry, 64(24):18209–18232, 2021

  53. [53]

    Jaakkola

    Wengong Jin, Regina Barzilay, and Tommi S. Jaakkola. Junction tree variational autoencoder for molecular graph generation. InICML, Proceedings of Machine Learning Research, pages 2328–2337. PMLR, 2018

  54. [54]

    Pubchem 2025 update.Nucleic acids research, 53(D1):D1516–D1525, 2025

    Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A Shoemaker, Paul A Thiessen, Bo Yu, et al. Pubchem 2025 update.Nucleic acids research, 53(D1):D1516–D1525, 2025

  55. [55]

    Kingma and Max Welling

    Diederik P . Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

  56. [56]

    Self- referencing embedded strings (SELFIES): A 100% robust molecular string representation.Mach

    Mario Krenn, Florian H ¨ase, AkshatKumar Nigam, Pascal Friederich, and Al ´an Aspuru-Guzik. Self- referencing embedded strings (SELFIES): A 100% robust molecular string representation.Mach. Learn. Sci. Technol., 1(4):45024, 2020

  57. [57]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InSOSP, pages 611–626. ACM, 2023

  58. [58]

    Compressed graph representation for scalable molecular graph generation.J

    Youngchun Kwon, Dongseon Lee, Youn-Suk Choi, Kyoham Shin, and Seokho Kang. Compressed graph representation for scalable molecular graph generation.J. Cheminformatics, 12(1):58, 2020

  59. [59]

    Sch ¨utt

    Tuan Le, Julian Cremer, Frank No ´e, Djork-Arn ´e Clevert, and Kristof T. Sch ¨utt. Navigating the design space of equivariant diffusion-based generative models for de novo 3d molecule generation. InICLR. OpenReview.net, 2024

  60. [60]

    Speak-to-structure: Evaluating llms in open-domain natural language-driven molecule generation

    Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Dongzhan Zhou, Xiao-yong Wei, and Qing Li. Speak-to-structure: Evaluating llms in open-domain natural language-driven molecule generation. arXiv preprint arXiv:2412.14642, 2024

  61. [61]

    Speaking the language of science: Toward a general-purpose generative foundation model for the natural sciences.arXiv preprint arXiv:2606.16905, 2026

    Mingyang Li, Yurou Liu, Jieping Ye, Bing Su, Ji-Rong Wen, and Zheng Wang. Speaking the language of science: Toward a general-purpose generative foundation model for the natural sciences.arXiv preprint arXiv:2606.16905, 2026

  62. [62]

    Monn: a multi-objective neural network for predicting compound-protein interactions and affinities.Cell systems, 10(4):308–322, 2020

    Shuya Li, Fangping Wan, Hantao Shu, Tao Jiang, Dan Zhao, and Jianyang Zeng. Monn: a multi-objective neural network for predicting compound-protein interactions and affinities.Cell systems, 10(4):308–322, 2020

  63. [63]

    Towards 3d molecule-text interpretation in language models

    Sihang Li, Zhiyuan Liu, Yanchen Luo, Xiang Wang, Xiangnan He, Kenji Kawaguchi, Tat-Seng Chua, and Qi Tian. Towards 3d molecule-text interpretation in language models. InICLR. OpenReview.net, 2024

  64. [64]

    Sc2mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer.Bioinform., 39(1), 2023

    Zhirui Liao, Lei Xie, Hiroshi Mamitsuka, and Shanfeng Zhu. Sc2mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer.Bioinform., 39(1), 2023

  65. [65]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023. 56 BioMatrix

  66. [66]

    Protein design with dynamic protein vocabulary

    Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, and Yuanbin Wu. Protein design with dynamic protein vocabulary. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  67. [67]

    A text-guided protein design framework.arXiv preprint arXiv:2302.04611, 2023

    Shengchao Liu, Yutao Zhu, Jiarui Lu, Zhao Xu, Weili Nie, Anthony Gitter, Chaowei Xiao, Jian Tang, Hongyu Guo, and Anima Anandkumar. A text-guided protein design framework.arXiv preprint arXiv:2302.04611, 2023

  68. [68]

    Jorissen, and Michael K

    Tiqing Liu, Yuhmei Lin, Xin Wen, Robert N. Jorissen, and Michael K. Gilson. Bindingdb: a web-accessible database of experimentally determined protein-ligand binding affinities.Nucleic Acids Res., 35(Database- Issue):198–201, 2007

  69. [69]

    Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data

    Tiqing Liu, Linda Hwang, Stephen K Burley, Carmen I Nitsche, Christopher Southan, W Patrick Walters, and Michael K Gilson. Bindingdb in 2024: a fair knowledgebase of protein-small molecule binding data. Nucleic acids research, 53(D1):D1633–D1644, 2025

  70. [70]

    Forging the basis for developing protein–ligand interaction scoring functions.Accounts of chemical research, 50(2):302–309, 2017

    Zhihai Liu, Minyi Su, Li Han, Jie Liu, Qifan Yang, Yan Li, and Renxiao Wang. Forging the basis for developing protein–ligand interaction scoring functions.Accounts of chemical research, 50(2):302–309, 2017

  71. [71]

    Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter

    Zhiyuan Liu, Sihang Li, Yanchen Luo, Hao Fei, Yixin Cao, Kenji Kawaguchi, Xiang Wang, and Tat-Seng Chua. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In EMNLP, pages 15623–15638. Association for Computational Linguistics, 2023

  72. [72]

    Prott3: Protein-to-text generation for text-based protein understanding

    Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. Prott3: Protein-to-text generation for text-based protein understanding. InACL (1), pages 5949–5966. Association for Computational Linguistics, 2024

  73. [73]

    Next-mol: 3d diffusion meets 1d language modeling for 3d molecule generation

    Zhiyuan Liu, Yanchen Luo, Han Huang, Enzhi Zhang, Sihang Li, Junfeng Fang, Yaorui Shi, Xiang Wang, Kenji Kawaguchi, and Tat-Seng Chua. Next-mol: 3d diffusion meets 1d language modeling for 3d molecule generation. InICLR. OpenReview.net, 2025

  74. [74]

    Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel S. Weld. S2ORC: the semantic scholar open research corpus. InACL, pages 4969–4983. Association for Computational Linguistics, 2020

  75. [75]

    Chemical reactions from us patents, 2017

    Daniel Lowe. Chemical reactions from us patents, 2017

  76. [76]

    Fineweb-edu: the finest collection of educational content, 2024

    Anton Lozhkov, Loubna Ben Allal, Leandro von Werra, and Thomas Wolf. Fineweb-edu: the finest collection of educational content, 2024. URLhttps://huggingface.co/datasets/HuggingFaceFW/fineweb-edu

  77. [77]

    Tankbind: Trigonometry- aware neural networks for drug-protein binding structure prediction

    Wei Lu, Qifeng Wu, Jixian Zhang, Jiahua Rao, Chengtao Li, and Shuangjia Zheng. Tankbind: Trigonometry- aware neural networks for drug-protein binding structure prediction. InNeurIPS, 2022

  78. [78]

    Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension

    Xingyu Lu, He Cao, Zijing Liu, Shengyuan Bai, Leqing Chen, Yuan Yao, Hai-Tao Zheng, and Yu Li. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. InEMNLP (Findings), Findings of ACL, pages 3769–3789. Association for Computational Linguistics, 2024

  79. [79]

    Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

    Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine.arXiv preprint arXiv:2308.09442, 2023

  80. [80]

    An autoregressive flow model for 3d molecular geometry generation from scratch

    Youzhi Luo and Shuiwang Ji. An autoregressive flow model for 3d molecular geometry generation from scratch. InICLR. OpenReview.net, 2022

Showing first 80 references.