pith. sign in

arxiv: 2607.01982 · v1 · pith:4AJNGBNMnew · submitted 2026-07-02 · 💻 cs.CV · cs.AI· q-bio.BM

MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding

Pith reviewed 2026-07-03 15:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIq-bio.BM
keywords vision-language modelsmolecular imageschemical structure understandingmolecular topologygraph-aware modelsdrug discoveryimage reasoningmolecular design
0
0 comments X

The pith

MolSight improves vision-language models for molecular images by injecting bond adjacency and aligning visuals to chemical symbols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MolSight to address how standard vision-language models miss structural details and topology when reading molecular drawings. It adds a Molecular Topology Module that feeds chemical-bond connections directly into the vision tokens and a Molecular Grounding Module that ties those tokens to the language of chemical formulas. Experiments show the combined system beats prior vision-language models, molecular language models, and domain-specific tools on several chemical image tasks. If the approach holds, systems could extract reliable structural and functional information straight from pictures instead of needing separate graph inputs.

Core claim

MolSight is a graph-aware vision-language model framework that integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens and a Molecular Grounding Module to align visual features with chemical symbolic semantics, resulting in stronger performance across molecular visual understanding tasks.

What carries the argument

Molecular Topology Module and Molecular Grounding Module, which embed bond adjacency into vision tokens and link those tokens to symbolic chemical semantics.

If this is right

  • Molecular images can be processed directly for structure and function without separate graph conversion steps.
  • Unified vision-language systems become viable for molecular design and drug discovery workflows.
  • Performance gains appear across multiple chemical visual tasks compared with prior models and tools.
  • A new baseline level of molecular image reasoning is established for subsequent work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same module pattern could be tested on other image domains that contain implicit graphs, such as circuit diagrams or protein interaction maps.
  • Performance on noisy or hand-drawn molecular sketches from laboratory notebooks remains an open question for real deployment.
  • Combining the model with larger language models might enable end-to-end pipelines that both interpret an image and propose synthetic routes.

Load-bearing premise

That adding bond adjacency to vision tokens and linking visuals to chemical symbols will overcome the structural alignment and topological weaknesses of existing vision-language models.

What would settle it

A controlled test in which a standard vision-language model without the two modules matches or exceeds MolSight accuracy on the same set of molecular image benchmarks.

Figures

Figures reproduced from arXiv: 2607.01982 by Wenda Wang, Yihan Tong, Yuwei Hu, Zhewei Wei.

Figure 1
Figure 1. Figure 1: Overview of MolSight. a) Overall framework. Molecular image tokens and SVG text tokens are updated and integrated through molecular topology and grounding modules before LLM decoding. b) The Molecular Topology Module injects molecular topology into vision tokens, while the Molecular Grounding Module aligns vision tokens with SVG text annotations, jointly introducing graph topology messages into vision-toke… view at source ↗
Figure 3
Figure 3. Figure 3: Representative molecular captioning comparison [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative molecular captioning comparison [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative examples for the SMILES transla [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Statistical analysis of MolSight’s molecular cap [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Absolute error distributions for physicochemical descriptor prediction. Each subplot shows the distribution of absolute [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prediction V.S. ground-truth scatter plots for physicochemical descriptor prediction. Each subplot reports the Pearson [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Representative cases of MolSight on SMILES translation tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Representative cases of MolSight on Molecular Captioning tasks. The green values indicate predictions strongly [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Representative cases of MolSight on descriptor estimation tasks. The green values indicate correct predictions, while [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

Using molecular large language models (LLMs) as a unified framework for understanding molecular structures and functions is emerging as a new trend in tasks such as molecular design and drug discovery. However, these models struggle to fully capture the visual representation of molecular structures, limiting their potential. While existing molecular vision-language models (VLMs) show promise, they still face challenges in structural alignment and lack the necessary topological modeling for accurate molecular understanding. To address this, we propose MolSight, a graph-aware vision-language model framework designed to enhance the understanding of molecular images by VLMs. MolSight integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens, and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Our experiments demonstrate that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks, achieving a new level of molecular image reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes MolSight, a graph-aware vision-language model for molecular image understanding. It integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Experiments are claimed to show that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks.

Significance. If validated, the approach of explicitly injecting graph topology and grounding visual features to symbolic chemistry could address known limitations in VLMs for structural molecular reasoning, with potential impact on drug discovery and molecular design pipelines.

major comments (1)
  1. Abstract: The central claim of significant outperformance is presented without any experimental details, datasets, metrics, baselines, error bars, or result tables, making it impossible to determine whether the performance claims are supported by the paper's own data or methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. The single major comment concerns the level of detail in the abstract; we address it directly below and note that the manuscript body contains the requested experimental information.

read point-by-point responses
  1. Referee: Abstract: The central claim of significant outperformance is presented without any experimental details, datasets, metrics, baselines, error bars, or result tables, making it impossible to determine whether the performance claims are supported by the paper's own data or methods.

    Authors: We agree that the abstract is written at a high level and omits specific experimental details. The full paper (Sections 4 and 5) specifies the datasets (e.g., PubChem, ChEMBL-derived molecular image sets), metrics (accuracy, F1, exact match on structure recognition and property prediction), baselines (standard VLMs, molecular LLMs, and specialized tools), and reports results with tables and error bars. To improve clarity we will revise the abstract to briefly name the primary evaluation tasks, key datasets, and the magnitude of gains while remaining within length limits. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description introduce MolSight via two new modules (Molecular Topology Module and Molecular Grounding Module) whose purpose is stated descriptively, with performance claims resting on experiments. No equations, parameter-fitting steps, self-citations used as load-bearing uniqueness theorems, or renamings of prior results appear in the text. The derivation chain therefore contains no self-definitional reductions, fitted-input predictions, or imported ansatzes that collapse to the inputs by construction. The central claim remains an empirical assertion about outperformance rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5696 in / 1117 out tokens · 27441 ms · 2026-07-03T15:38:45.860085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 13 canonical work pages · 8 internal anchors

  1. [1]

    Cell , volume=

    A deep learning approach to antibiotic discovery , author=. Cell , volume=. 2020 , publisher=

  2. [2]

    Nature , volume=

    Scaling deep learning for materials discovery , author=. Nature , volume=. 2023 , publisher=

  3. [3]

    SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , author=. Journal of chemical information and computer sciences , volume=. 1988 , publisher=

  4. [5]

    Nature , volume=

    Autonomous chemical research with large language models , author=. Nature , volume=. 2023 , publisher=

  5. [6]

    Digital Discovery , volume=

    Comparing software tools for optical chemical structure recognition , author=. Digital Discovery , volume=. 2024 , publisher=

  6. [7]

    Nature Communications , volume=

    PatCID: an open-access dataset of chemical structures in patent documents , author=. Nature Communications , volume=. 2024 , publisher=

  7. [9]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  8. [11]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  9. [12]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  10. [13]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  11. [15]

    International conference on machine learning , pages=

    Neural message passing for quantum chemistry , author=. International conference on machine learning , pages=. 2017 , organization=

  12. [17]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

    Translation between molecules and natural language , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

  13. [18]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  14. [20]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  15. [22]

    The Thirteenth International Conference on Learning Representations , year=

    Atomas: Hierarchical adaptive alignment on molecule-text for unified molecule understanding and generation , author=. The Thirteenth International Conference on Learning Representations , year=

  16. [23]

    Computers in biology and medicine , volume=

    Git-mol: A multi-modal large language model for molecular science with graph, image, and text , author=. Computers in biology and medicine , volume=. 2024 , publisher=

  17. [24]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Chemvlm: Exploring the power of multimodal large language models in chemistry area , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  18. [27]

    Advances in neural information processing systems , volume=

    Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

  19. [29]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  20. [30]

    European conference on computer vision , pages=

    Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

  21. [32]

    2024 International Joint Conference on Neural Networks (IJCNN) , pages=

    Zero-shot visual reasoning by vision-language models: Benchmarking and analysis , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

  22. [34]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

  23. [35]

    Nucleic acids research , volume=

    PubChem 2023 update , author=. Nucleic acids research , volume=. 2023 , publisher=

  24. [36]

    Nature Machine Intelligence , volume=

    Multi-modal molecule structure--text model for text-based retrieval and editing , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

  25. [37]

    Journal of chemical information and modeling , volume=

    MolScribe: robust molecular structure recognition with image-to-graph generation , author=. Journal of chemical information and modeling , volume=. 2023 , publisher=

  26. [38]

    Journal of Cheminformatics , volume=

    DECIMER: towards deep learning for chemical image recognition , author=. Journal of Cheminformatics , volume=. 2020 , publisher=

  27. [39]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  28. [40]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    RTMol: Rethinking Molecule-text Alignment in a Round-trip View , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  29. [41]

    Chemical science , volume=

    MoleculeNet: a benchmark for molecular machine learning , author=. Chemical science , volume=. 2018 , publisher=

  30. [42]

    Advances in Neural Information Processing Systems , volume=

    Molvision: Molecular property prediction with vision language models , author=. Advances in Neural Information Processing Systems , volume=

  31. [43]

    Nature Machine Intelligence , volume=

    A quantitative analysis of knowledge-learning preferences in large language models in molecular science , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

  32. [44]

    International Conference on Learning Representations , volume=

    Mol-instructions: A large-scale biomolecular instruction dataset for large language models , author=. International Conference on Learning Representations , volume=

  33. [45]

    GPT-4 Technical Report

    Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  34. [46]

    Adak, D.; Rawat, Y.; and Vyas, S. 2026. Molvision: Molecular property prediction with vision language models. Advances in Neural Information Processing Systems 38

  35. [47]

    Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35:23716--23736

  36. [48]

    Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

  37. [49]

    Relational inductive biases, deep learning, and graph networks

    Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261

  38. [50]

    A.; MacKnight, R.; Kline, B.; and Gomes, G

    Boiko, D. A.; MacKnight, R.; Kline, B.; and Gomes, G. 2023. Autonomous chemical research with large language models. Nature 624(7992):570--578

  39. [51]

    ChemCrow: Augmenting large-language models with chemistry tools

    Bran, A. M.; Cox, S.; Schilter, O.; Baldassari, C.; White, A. D.; and Schwaller, P. 2023. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376

  40. [52]

    Caron, M.; Touvron, H.; Misra, I.; J \'e gou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision , 9650--9660

  41. [53]

    Chen, Y.; Yao, Q.; Zhang, J.; Cheng, J.; and Bian, Y. 2024. Hight: Hierarchical graph tokenization for molecule-language alignment. arXiv preprint arXiv:2406.14021

  42. [54]

    Chen, L.; Shi, R.; Yu, G.; and Yang, Y. 2026. Rtmol: Rethinking molecule-text alignment in a round-trip view. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, 75--82

  43. [55]

    Cui, C.; Zhou, Y.; Yang, X.; Wu, S.; Zhang, L.; Zou, J.; and Yao, H. 2023. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287

  44. [56]

    Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  45. [57]

    Edwards, C.; Lai, T.; Ros, K.; Honke, G.; Cho, K.; and Ji, H. 2022. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , 375--413

  46. [58]

    Fang, Y.; Liang, X.; Zhang, N.; Liu, K.; Huang, R.; Chen, Z.; Fan, X.; and Chen, H. 2024. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. In International Conference on Learning Representations , volume 2024, 48221--48251

  47. [59]

    S.; Riley, P

    Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In International conference on machine learning , 1263--1272. Pmlr

  48. [60]

    Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , 6904--6913

  49. [61]

    A.; Thiessen, P

    Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; et al. 2023. Pubchem 2023 update. Nucleic acids research 51(D1):D1373--D1380

  50. [62]

    Semi-Supervised Classification with Graph Convolutional Networks

    Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  51. [63]

    J.; Boehme, T.; Boyer, S

    Krasnov, A.; Barnabas, S. J.; Boehme, T.; Boyer, S. K.; and Weber, L. 2024. Comparing software tools for optical chemical structure recognition. Digital Discovery 3(4):681--693

  52. [64]

    Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning , 19730--19742. PMLR

  53. [65]

    Li, J.; Zhang, D.; Wang, X.; Hao, Z.; Lei, J.; Tan, Q.; Zhou, C.; Liu, W.; Yang, Y.; Xiong, X.; et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 39, 415--423

  54. [66]

    Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll \'a r, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision , 740--755. Springer

  55. [67]

    Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023a. Visual instruction tuning. Advances in neural information processing systems 36:34892--34916

  56. [68]

    Liu, S.; Nie, W.; Wang, C.; Lu, J.; Qiao, Z.; Liu, L.; Tang, J.; Xiao, C.; and Anandkumar, A. 2023b. Multi-modal molecule structure--text model for text-based retrieval and editing. Nature Machine Intelligence 5(12):1447--1457

  57. [69]

    Liu, Z.; Li, S.; Luo, Y.; Fei, H.; Cao, Y.; Kawaguchi, K.; Wang, X.; and Chua, T.-S. 2023c. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 15623--15638

  58. [70]

    Liu, P.; Ren, Y.; Tao, J.; and Ren, Z. 2024. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Computers in biology and medicine 171:108073

  59. [71]

    Liu, P.; Tao, J.; and Ren, Z. 2025. A quantitative analysis of knowledge-learning preferences in large language models in molecular science. Nature Machine Intelligence 7(2):315--327

  60. [72]

    Lu, X.; Cao, H.; Liu, Z.; Bai, S.; Chen, L.; Yao, Y.; Zheng, H.-T.; and Li, Y. 2024. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2024 , 3769--3789

  61. [73]

    S.; Aykol, M.; Cheon, G.; and Cubuk, E

    Merchant, A.; Batzner, S.; Schoenholz, S. S.; Aykol, M.; Cheon, G.; and Cubuk, E. D. 2023. Scaling deep learning for materials discovery. Nature 624(7990):80--85

  62. [74]

    I.; Yu, F.; and Staar, P

    Morin, L.; Weber, V.; Meijer, G. I.; Yu, F.; and Staar, P. W. 2024. Patcid: an open-access dataset of chemical structures in patent documents. Nature Communications 15(1):6532

  63. [75]

    Nagar, A.; Jaiswal, S.; and Tan, C. 2024. Zero-shot visual reasoning by vision-language models: Benchmarking and analysis. In 2024 International Joint Conference on Neural Networks (IJCNN) , 1--8. IEEE

  64. [76]

    Pei, Q.; Zhang, W.; Zhu, J.; Wu, K.; Gao, K.; Wu, L.; Xia, Y.; and Yan, R. 2023. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 1102--1123

  65. [77]

    W.; and Barzilay, R

    Qian, Y.; Guo, J.; Tu, Z.; Li, Z.; Coley, C. W.; and Barzilay, R. 2023. Molscribe: robust molecular structure recognition with image-to-graph generation. Journal of chemical information and modeling 63(7):1925--1934

  66. [78]

    W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

    Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning , 8748--8763. PmLR

  67. [79]

    Rajan, K.; Zielesny, A.; and Steinbeck, C. 2020. Decimer: towards deep learning for chemical image recognition. Journal of Cheminformatics 12(1):65

  68. [80]

    M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.; Donghia, N

    Stokes, J. M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.; Donghia, N. M.; MacNair, C. R.; French, S.; Carfrae, L. A.; Bloom-Ackermann, Z.; et al. 2020. A deep learning approach to antibiotic discovery. Cell 180(4):688--702

  69. [81]

    Tan, Q.; Zhou, D.; Xia, P.; Liu, W.; Ouyang, W.; Bai, L.; Li, Y.; and Fu, T. 2025. Chemmllm: Chemical multimodal large language model. arXiv preprint arXiv:2505.16326

  70. [82]

    Veli c kovi \'c , P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903

  71. [83]

    Weininger, D. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28(1):31--36

  72. [84]

    N.; Gomes, J.; Geniesse, C.; Pappu, A

    Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; and Pande, V. 2018. Moleculenet: a benchmark for molecular machine learning. Chemical science 9(2):513--530

  73. [85]

    Xu, W.; Wang, J.; Wang, W.; Chen, Z.; Zhou, W.; Yang, A.; Lu, L.; Li, H.; Wang, X.; Zhu, X.; et al. 2025. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279

  74. [86]

    Zhang, D.; Liu, W.; Tan, Q.; Chen, J.; Yan, H.; Yan, Y.; Li, J.; Huang, W.; Yue, X.; Ouyang, W.; et al. 2024. Chemllm: A chemical large language model. arXiv preprint arXiv:2402.06852

  75. [87]

    Zhang, Y.; Ye, G.; Yuan, C.; Han, B.; Huang, L.-K.; Yao, J.; Liu, W.; and Rong, Y. 2025. Atomas: Hierarchical adaptive alignment on molecule-text for unified molecule understanding and generation. In The Thirteenth International Conference on Learning Representations