MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding

Wenda Wang; Yihan Tong; Yuwei Hu; Zhewei Wei

arxiv: 2607.01982 · v1 · pith:4AJNGBNMnew · submitted 2026-07-02 · 💻 cs.CV · cs.AI· q-bio.BM

MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding

Wenda Wang , Yihan Tong , Yuwei Hu , Zhewei Wei This is my paper

Pith reviewed 2026-07-03 15:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIq-bio.BM

keywords vision-language modelsmolecular imageschemical structure understandingmolecular topologygraph-aware modelsdrug discoveryimage reasoningmolecular design

0 comments

The pith

MolSight improves vision-language models for molecular images by injecting bond adjacency and aligning visuals to chemical symbols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MolSight to address how standard vision-language models miss structural details and topology when reading molecular drawings. It adds a Molecular Topology Module that feeds chemical-bond connections directly into the vision tokens and a Molecular Grounding Module that ties those tokens to the language of chemical formulas. Experiments show the combined system beats prior vision-language models, molecular language models, and domain-specific tools on several chemical image tasks. If the approach holds, systems could extract reliable structural and functional information straight from pictures instead of needing separate graph inputs.

Core claim

MolSight is a graph-aware vision-language model framework that integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens and a Molecular Grounding Module to align visual features with chemical symbolic semantics, resulting in stronger performance across molecular visual understanding tasks.

What carries the argument

Molecular Topology Module and Molecular Grounding Module, which embed bond adjacency into vision tokens and link those tokens to symbolic chemical semantics.

If this is right

Molecular images can be processed directly for structure and function without separate graph conversion steps.
Unified vision-language systems become viable for molecular design and drug discovery workflows.
Performance gains appear across multiple chemical visual tasks compared with prior models and tools.
A new baseline level of molecular image reasoning is established for subsequent work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same module pattern could be tested on other image domains that contain implicit graphs, such as circuit diagrams or protein interaction maps.
Performance on noisy or hand-drawn molecular sketches from laboratory notebooks remains an open question for real deployment.
Combining the model with larger language models might enable end-to-end pipelines that both interpret an image and propose synthetic routes.

Load-bearing premise

That adding bond adjacency to vision tokens and linking visuals to chemical symbols will overcome the structural alignment and topological weaknesses of existing vision-language models.

What would settle it

A controlled test in which a standard vision-language model without the two modules matches or exceeds MolSight accuracy on the same set of molecular image benchmarks.

Figures

Figures reproduced from arXiv: 2607.01982 by Wenda Wang, Yihan Tong, Yuwei Hu, Zhewei Wei.

**Figure 1.** Figure 1: Overview of MolSight. a) Overall framework. Molecular image tokens and SVG text tokens are updated and integrated through molecular topology and grounding modules before LLM decoding. b) The Molecular Topology Module injects molecular topology into vision tokens, while the Molecular Grounding Module aligns vision tokens with SVG text annotations, jointly introducing graph topology messages into vision-toke… view at source ↗

**Figure 3.** Figure 3: Representative molecular captioning comparison [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 2.** Figure 2: Representative molecular captioning comparison [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 5.** Figure 5: Representative examples for the SMILES transla [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Statistical analysis of MolSight’s molecular cap [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Absolute error distributions for physicochemical descriptor prediction. Each subplot shows the distribution of absolute [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Prediction V.S. ground-truth scatter plots for physicochemical descriptor prediction. Each subplot reports the Pearson [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Representative cases of MolSight on SMILES translation tasks. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Representative cases of MolSight on Molecular Captioning tasks. The green values indicate predictions strongly [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Representative cases of MolSight on descriptor estimation tasks. The green values indicate correct predictions, while [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

read the original abstract

Using molecular large language models (LLMs) as a unified framework for understanding molecular structures and functions is emerging as a new trend in tasks such as molecular design and drug discovery. However, these models struggle to fully capture the visual representation of molecular structures, limiting their potential. While existing molecular vision-language models (VLMs) show promise, they still face challenges in structural alignment and lack the necessary topological modeling for accurate molecular understanding. To address this, we propose MolSight, a graph-aware vision-language model framework designed to enhance the understanding of molecular images by VLMs. MolSight integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens, and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Our experiments demonstrate that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks, achieving a new level of molecular image reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MolSight adds a topology module for bond adjacency and a grounding module for symbol alignment to VLMs, but the abstract gives no data or experiments to check if it actually improves molecular image understanding.

read the letter

MolSight's main move is to insert a Molecular Topology Module that feeds chemical-bond adjacency into vision tokens and a Molecular Grounding Module that lines up visual features with chemical symbolic semantics. The goal is to fix the structural alignment and topological modeling shortfalls that standard VLMs show on molecular images.

The paper states the problem cleanly: molecular LLMs ignore visual structure while existing VLMs lack graph awareness. Adding explicit graph injection and grounding is a direct, incremental response to that gap and fits the chemistry setting where bond topology matters.

The clear limitation is the total absence of any experimental support. The abstract claims the model "significantly outperforms" VLMs, molecular LLMs, and specialized tools, yet supplies no datasets, metrics, baselines, ablations, or numbers. Without those, the performance claims cannot be checked for soundness or reproducibility. The stress-test note is right that no load-bearing technical flaw can be isolated because there is nothing concrete to examine.

This is for people already working on multimodal models for chemistry or drug discovery who want to see one way to bring graph structure into VLMs. A reader in that niche might look at the full methods if they exist, but the current version offers little to act on.

I would send it to peer review only if the full manuscript contains proper experiments, comparisons, and implementation details; on the abstract alone it is too thin.

Referee Report

1 major / 0 minor

Summary. The paper proposes MolSight, a graph-aware vision-language model for molecular image understanding. It integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Experiments are claimed to show that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks.

Significance. If validated, the approach of explicitly injecting graph topology and grounding visual features to symbolic chemistry could address known limitations in VLMs for structural molecular reasoning, with potential impact on drug discovery and molecular design pipelines.

major comments (1)

Abstract: The central claim of significant outperformance is presented without any experimental details, datasets, metrics, baselines, error bars, or result tables, making it impossible to determine whether the performance claims are supported by the paper's own data or methods.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their feedback. The single major comment concerns the level of detail in the abstract; we address it directly below and note that the manuscript body contains the requested experimental information.

read point-by-point responses

Referee: Abstract: The central claim of significant outperformance is presented without any experimental details, datasets, metrics, baselines, error bars, or result tables, making it impossible to determine whether the performance claims are supported by the paper's own data or methods.

Authors: We agree that the abstract is written at a high level and omits specific experimental details. The full paper (Sections 4 and 5) specifies the datasets (e.g., PubChem, ChEMBL-derived molecular image sets), metrics (accuracy, F1, exact match on structure recognition and property prediction), baselines (standard VLMs, molecular LLMs, and specialized tools), and reports results with tables and error bars. To improve clarity we will revise the abstract to briefly name the primary evaluation tasks, key datasets, and the magnitude of gains while remaining within length limits. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and description introduce MolSight via two new modules (Molecular Topology Module and Molecular Grounding Module) whose purpose is stated descriptively, with performance claims resting on experiments. No equations, parameter-fitting steps, self-citations used as load-bearing uniqueness theorems, or renamings of prior results appear in the text. The derivation chain therefore contains no self-definitional reductions, fitted-input predictions, or imported ansatzes that collapse to the inputs by construction. The central claim remains an empirical assertion about outperformance rather than a closed logical loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5696 in / 1117 out tokens · 27441 ms · 2026-07-03T15:38:45.860085+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Cell , volume=

A deep learning approach to antibiotic discovery , author=. Cell , volume=. 2020 , publisher=

2020
[2]

Nature , volume=

Scaling deep learning for materials discovery , author=. Nature , volume=. 2023 , publisher=

2023
[3]

SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , author=. Journal of chemical information and computer sciences , volume=. 1988 , publisher=

1988
[5]

Nature , volume=

Autonomous chemical research with large language models , author=. Nature , volume=. 2023 , publisher=

2023
[6]

Digital Discovery , volume=

Comparing software tools for optical chemical structure recognition , author=. Digital Discovery , volume=. 2024 , publisher=

2024
[7]

Nature Communications , volume=

PatCID: an open-access dataset of chemical structures in patent documents , author=. Nature Communications , volume=. 2024 , publisher=

2024
[9]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[11]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[12]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[13]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=
[15]

International conference on machine learning , pages=

Neural message passing for quantum chemistry , author=. International conference on machine learning , pages=. 2017 , organization=

2017
[17]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Translation between molecules and natural language , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022
[18]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[20]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[22]

The Thirteenth International Conference on Learning Representations , year=

Atomas: Hierarchical adaptive alignment on molecule-text for unified molecule understanding and generation , author=. The Thirteenth International Conference on Learning Representations , year=
[23]

Computers in biology and medicine , volume=

Git-mol: A multi-modal large language model for molecular science with graph, image, and text , author=. Computers in biology and medicine , volume=. 2024 , publisher=

2024
[24]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Chemvlm: Exploring the power of multimodal large language models in chemistry area , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[27]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=
[29]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[30]

European conference on computer vision , pages=

Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

2014
[32]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Zero-shot visual reasoning by vision-language models: Benchmarking and analysis , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

2024
[34]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Nucleic acids research , volume=

PubChem 2023 update , author=. Nucleic acids research , volume=. 2023 , publisher=

2023
[36]

Nature Machine Intelligence , volume=

Multi-modal molecule structure--text model for text-based retrieval and editing , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

2023
[37]

Journal of chemical information and modeling , volume=

MolScribe: robust molecular structure recognition with image-to-graph generation , author=. Journal of chemical information and modeling , volume=. 2023 , publisher=

2023
[38]

Journal of Cheminformatics , volume=

DECIMER: towards deep learning for chemical image recognition , author=. Journal of Cheminformatics , volume=. 2020 , publisher=

2020
[39]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[40]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

RTMol: Rethinking Molecule-text Alignment in a Round-trip View , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[41]

Chemical science , volume=

MoleculeNet: a benchmark for molecular machine learning , author=. Chemical science , volume=. 2018 , publisher=

2018
[42]

Advances in Neural Information Processing Systems , volume=

Molvision: Molecular property prediction with vision language models , author=. Advances in Neural Information Processing Systems , volume=
[43]

Nature Machine Intelligence , volume=

A quantitative analysis of knowledge-learning preferences in large language models in molecular science , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

2025
[44]

International Conference on Learning Representations , volume=

Mol-instructions: A large-scale biomolecular instruction dataset for large language models , author=. International Conference on Learning Representations , volume=
[45]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Adak, D.; Rawat, Y.; and Vyas, S. 2026. Molvision: Molecular property prediction with vision language models. Advances in Neural Information Processing Systems 38

2026
[47]

Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35:23716--23736

2022
[48]

Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Relational inductive biases, deep learning, and graph networks

Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261

work page internal anchor Pith review Pith/arXiv arXiv 2018
[50]

A.; MacKnight, R.; Kline, B.; and Gomes, G

Boiko, D. A.; MacKnight, R.; Kline, B.; and Gomes, G. 2023. Autonomous chemical research with large language models. Nature 624(7992):570--578

2023
[51]

ChemCrow: Augmenting large-language models with chemistry tools

Bran, A. M.; Cox, S.; Schilter, O.; Baldassari, C.; White, A. D.; and Schwaller, P. 2023. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Caron, M.; Touvron, H.; Misra, I.; J \'e gou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision , 9650--9660

2021
[53]

Chen, Y.; Yao, Q.; Zhang, J.; Cheng, J.; and Bian, Y. 2024. Hight: Hierarchical graph tokenization for molecule-language alignment. arXiv preprint arXiv:2406.14021

work page arXiv 2024
[54]

Chen, L.; Shi, R.; Yu, G.; and Yang, Y. 2026. Rtmol: Rethinking molecule-text alignment in a round-trip view. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, 75--82

2026
[55]

Cui, C.; Zhou, Y.; Yang, X.; Wu, S.; Zhang, L.; Zou, J.; and Yao, H. 2023. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287

work page arXiv 2023
[56]

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020
[57]

Edwards, C.; Lai, T.; Ros, K.; Honke, G.; Cho, K.; and Ji, H. 2022. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , 375--413

2022
[58]

Fang, Y.; Liang, X.; Zhang, N.; Liu, K.; Huang, R.; Chen, Z.; Fan, X.; and Chen, H. 2024. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. In International Conference on Learning Representations , volume 2024, 48221--48251

2024
[59]

S.; Riley, P

Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In International conference on machine learning , 1263--1272. Pmlr

2017
[60]

Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , 6904--6913

2017
[61]

A.; Thiessen, P

Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; et al. 2023. Pubchem 2023 update. Nucleic acids research 51(D1):D1373--D1380

2023
[62]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2016
[63]

J.; Boehme, T.; Boyer, S

Krasnov, A.; Barnabas, S. J.; Boehme, T.; Boyer, S. K.; and Weber, L. 2024. Comparing software tools for optical chemical structure recognition. Digital Discovery 3(4):681--693

2024
[64]

Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning , 19730--19742. PMLR

2023
[65]

Li, J.; Zhang, D.; Wang, X.; Hao, Z.; Lei, J.; Tan, Q.; Zhou, C.; Liu, W.; Yang, Y.; Xiong, X.; et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 39, 415--423

2025
[66]

Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll \'a r, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision , 740--755. Springer

2014
[67]

Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023a. Visual instruction tuning. Advances in neural information processing systems 36:34892--34916
[68]

Liu, S.; Nie, W.; Wang, C.; Lu, J.; Qiao, Z.; Liu, L.; Tang, J.; Xiao, C.; and Anandkumar, A. 2023b. Multi-modal molecule structure--text model for text-based retrieval and editing. Nature Machine Intelligence 5(12):1447--1457
[69]

Liu, Z.; Li, S.; Luo, Y.; Fei, H.; Cao, Y.; Kawaguchi, K.; Wang, X.; and Chua, T.-S. 2023c. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 15623--15638

2023
[70]

Liu, P.; Ren, Y.; Tao, J.; and Ren, Z. 2024. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Computers in biology and medicine 171:108073

2024
[71]

Liu, P.; Tao, J.; and Ren, Z. 2025. A quantitative analysis of knowledge-learning preferences in large language models in molecular science. Nature Machine Intelligence 7(2):315--327

2025
[72]

Lu, X.; Cao, H.; Liu, Z.; Bai, S.; Chen, L.; Yao, Y.; Zheng, H.-T.; and Li, Y. 2024. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2024 , 3769--3789

2024
[73]

S.; Aykol, M.; Cheon, G.; and Cubuk, E

Merchant, A.; Batzner, S.; Schoenholz, S. S.; Aykol, M.; Cheon, G.; and Cubuk, E. D. 2023. Scaling deep learning for materials discovery. Nature 624(7990):80--85

2023
[74]

I.; Yu, F.; and Staar, P

Morin, L.; Weber, V.; Meijer, G. I.; Yu, F.; and Staar, P. W. 2024. Patcid: an open-access dataset of chemical structures in patent documents. Nature Communications 15(1):6532

2024
[75]

Nagar, A.; Jaiswal, S.; and Tan, C. 2024. Zero-shot visual reasoning by vision-language models: Benchmarking and analysis. In 2024 International Joint Conference on Neural Networks (IJCNN) , 1--8. IEEE

2024
[76]

Pei, Q.; Zhang, W.; Zhu, J.; Wu, K.; Gao, K.; Wu, L.; Xia, Y.; and Yan, R. 2023. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 1102--1123

2023
[77]

W.; and Barzilay, R

Qian, Y.; Guo, J.; Tu, Z.; Li, Z.; Coley, C. W.; and Barzilay, R. 2023. Molscribe: robust molecular structure recognition with image-to-graph generation. Journal of chemical information and modeling 63(7):1925--1934

2023
[78]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning , 8748--8763. PmLR

2021
[79]

Rajan, K.; Zielesny, A.; and Steinbeck, C. 2020. Decimer: towards deep learning for chemical image recognition. Journal of Cheminformatics 12(1):65

2020
[80]

M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.; Donghia, N

Stokes, J. M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.; Donghia, N. M.; MacNair, C. R.; French, S.; Carfrae, L. A.; Bloom-Ackermann, Z.; et al. 2020. A deep learning approach to antibiotic discovery. Cell 180(4):688--702

2020
[81]

Tan, Q.; Zhou, D.; Xia, P.; Liu, W.; Ouyang, W.; Bai, L.; Li, Y.; and Fu, T. 2025. Chemmllm: Chemical multimodal large language model. arXiv preprint arXiv:2505.16326

work page arXiv 2025
[82]

Veli c kovi \'c , P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903

work page internal anchor Pith review Pith/arXiv arXiv 2017
[83]

Weininger, D. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28(1):31--36

1988
[84]

N.; Gomes, J.; Geniesse, C.; Pappu, A

Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; and Pande, V. 2018. Moleculenet: a benchmark for molecular machine learning. Chemical science 9(2):513--530

2018
[85]

Xu, W.; Wang, J.; Wang, W.; Chen, Z.; Zhou, W.; Yang, A.; Lu, L.; Li, H.; Wang, X.; Zhu, X.; et al. 2025. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279

work page arXiv 2025
[86]

Zhang, D.; Liu, W.; Tan, Q.; Chen, J.; Yan, H.; Yan, Y.; Li, J.; Huang, W.; Yue, X.; Ouyang, W.; et al. 2024. Chemllm: A chemical large language model. arXiv preprint arXiv:2402.06852

work page arXiv 2024
[87]

Zhang, Y.; Ye, G.; Yuan, C.; Han, B.; Huang, L.-K.; Yao, J.; Liu, W.; and Rong, Y. 2025. Atomas: Hierarchical adaptive alignment on molecule-text for unified molecule understanding and generation. In The Thirteenth International Conference on Learning Representations

2025

[1] [1]

Cell , volume=

A deep learning approach to antibiotic discovery , author=. Cell , volume=. 2020 , publisher=

2020

[2] [2]

Nature , volume=

Scaling deep learning for materials discovery , author=. Nature , volume=. 2023 , publisher=

2023

[3] [3]

SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , author=. Journal of chemical information and computer sciences , volume=. 1988 , publisher=

1988

[4] [5]

Nature , volume=

Autonomous chemical research with large language models , author=. Nature , volume=. 2023 , publisher=

2023

[5] [6]

Digital Discovery , volume=

Comparing software tools for optical chemical structure recognition , author=. Digital Discovery , volume=. 2024 , publisher=

2024

[6] [7]

Nature Communications , volume=

PatCID: an open-access dataset of chemical structures in patent documents , author=. Nature Communications , volume=. 2024 , publisher=

2024

[7] [9]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021

[8] [11]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

[9] [12]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[10] [13]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

[11] [15]

International conference on machine learning , pages=

Neural message passing for quantum chemistry , author=. International conference on machine learning , pages=. 2017 , organization=

2017

[12] [17]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Translation between molecules and natural language , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

2022

[13] [18]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[14] [20]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[15] [22]

The Thirteenth International Conference on Learning Representations , year=

Atomas: Hierarchical adaptive alignment on molecule-text for unified molecule understanding and generation , author=. The Thirteenth International Conference on Learning Representations , year=

[16] [23]

Computers in biology and medicine , volume=

Git-mol: A multi-modal large language model for molecular science with graph, image, and text , author=. Computers in biology and medicine , volume=. 2024 , publisher=

2024

[17] [24]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Chemvlm: Exploring the power of multimodal large language models in chemistry area , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[18] [27]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

[19] [29]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

[20] [30]

European conference on computer vision , pages=

Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

2014

[21] [32]

2024 International Joint Conference on Neural Networks (IJCNN) , pages=

Zero-shot visual reasoning by vision-language models: Benchmarking and analysis , author=. 2024 International Joint Conference on Neural Networks (IJCNN) , pages=. 2024 , organization=

2024

[22] [34]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [35]

Nucleic acids research , volume=

PubChem 2023 update , author=. Nucleic acids research , volume=. 2023 , publisher=

2023

[24] [36]

Nature Machine Intelligence , volume=

Multi-modal molecule structure--text model for text-based retrieval and editing , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

2023

[25] [37]

Journal of chemical information and modeling , volume=

MolScribe: robust molecular structure recognition with image-to-graph generation , author=. Journal of chemical information and modeling , volume=. 2023 , publisher=

2023

[26] [38]

Journal of Cheminformatics , volume=

DECIMER: towards deep learning for chemical image recognition , author=. Journal of Cheminformatics , volume=. 2020 , publisher=

2020

[27] [39]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[28] [40]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

RTMol: Rethinking Molecule-text Alignment in a Round-trip View , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[29] [41]

Chemical science , volume=

MoleculeNet: a benchmark for molecular machine learning , author=. Chemical science , volume=. 2018 , publisher=

2018

[30] [42]

Advances in Neural Information Processing Systems , volume=

Molvision: Molecular property prediction with vision language models , author=. Advances in Neural Information Processing Systems , volume=

[31] [43]

Nature Machine Intelligence , volume=

A quantitative analysis of knowledge-learning preferences in large language models in molecular science , author=. Nature Machine Intelligence , volume=. 2025 , publisher=

2025

[32] [44]

International Conference on Learning Representations , volume=

Mol-instructions: A large-scale biomolecular instruction dataset for large language models , author=. International Conference on Learning Representations , volume=

[33] [45]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [46]

Adak, D.; Rawat, Y.; and Vyas, S. 2026. Molvision: Molecular property prediction with vision language models. Advances in Neural Information Processing Systems 38

2026

[35] [47]

Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35:23716--23736

2022

[36] [48]

Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [49]

Relational inductive biases, deep learning, and graph networks

Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski, M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; et al. 2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261

work page internal anchor Pith review Pith/arXiv arXiv 2018

[38] [50]

A.; MacKnight, R.; Kline, B.; and Gomes, G

Boiko, D. A.; MacKnight, R.; Kline, B.; and Gomes, G. 2023. Autonomous chemical research with large language models. Nature 624(7992):570--578

2023

[39] [51]

ChemCrow: Augmenting large-language models with chemistry tools

Bran, A. M.; Cox, S.; Schilter, O.; Baldassari, C.; White, A. D.; and Schwaller, P. 2023. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [52]

Caron, M.; Touvron, H.; Misra, I.; J \'e gou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision , 9650--9660

2021

[41] [53]

Chen, Y.; Yao, Q.; Zhang, J.; Cheng, J.; and Bian, Y. 2024. Hight: Hierarchical graph tokenization for molecule-language alignment. arXiv preprint arXiv:2406.14021

work page arXiv 2024

[42] [54]

Chen, L.; Shi, R.; Yu, G.; and Yang, Y. 2026. Rtmol: Rethinking molecule-text alignment in a round-trip view. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 40, 75--82

2026

[43] [55]

Cui, C.; Zhou, Y.; Yang, X.; Wu, S.; Zhang, L.; Zou, J.; and Yao, H. 2023. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287

work page arXiv 2023

[44] [56]

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2020

[45] [57]

Edwards, C.; Lai, T.; Ros, K.; Honke, G.; Cho, K.; and Ji, H. 2022. Translation between molecules and natural language. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , 375--413

2022

[46] [58]

Fang, Y.; Liang, X.; Zhang, N.; Liu, K.; Huang, R.; Chen, Z.; Fan, X.; and Chen, H. 2024. Mol-instructions: A large-scale biomolecular instruction dataset for large language models. In International Conference on Learning Representations , volume 2024, 48221--48251

2024

[47] [59]

S.; Riley, P

Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In International conference on machine learning , 1263--1272. Pmlr

2017

[48] [60]

Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Parikh, D. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition , 6904--6913

2017

[49] [61]

A.; Thiessen, P

Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; et al. 2023. Pubchem 2023 update. Nucleic acids research 51(D1):D1373--D1380

2023

[50] [62]

Semi-Supervised Classification with Graph Convolutional Networks

Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

work page internal anchor Pith review Pith/arXiv arXiv 2016

[51] [63]

J.; Boehme, T.; Boyer, S

Krasnov, A.; Barnabas, S. J.; Boehme, T.; Boyer, S. K.; and Weber, L. 2024. Comparing software tools for optical chemical structure recognition. Digital Discovery 3(4):681--693

2024

[52] [64]

Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning , 19730--19742. PMLR

2023

[53] [65]

Li, J.; Zhang, D.; Wang, X.; Hao, Z.; Lei, J.; Tan, Q.; Zhou, C.; Liu, W.; Yang, Y.; Xiong, X.; et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 39, 415--423

2025

[54] [66]

Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll \'a r, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In European conference on computer vision , 740--755. Springer

2014

[55] [67]

Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023a. Visual instruction tuning. Advances in neural information processing systems 36:34892--34916

[56] [68]

Liu, S.; Nie, W.; Wang, C.; Lu, J.; Qiao, Z.; Liu, L.; Tang, J.; Xiao, C.; and Anandkumar, A. 2023b. Multi-modal molecule structure--text model for text-based retrieval and editing. Nature Machine Intelligence 5(12):1447--1457

[57] [69]

Liu, Z.; Li, S.; Luo, Y.; Fei, H.; Cao, Y.; Kawaguchi, K.; Wang, X.; and Chua, T.-S. 2023c. Molca: Molecular graph-language modeling with cross-modal projector and uni-modal adapter. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 15623--15638

2023

[58] [70]

Liu, P.; Ren, Y.; Tao, J.; and Ren, Z. 2024. Git-mol: A multi-modal large language model for molecular science with graph, image, and text. Computers in biology and medicine 171:108073

2024

[59] [71]

Liu, P.; Tao, J.; and Ren, Z. 2025. A quantitative analysis of knowledge-learning preferences in large language models in molecular science. Nature Machine Intelligence 7(2):315--327

2025

[60] [72]

Lu, X.; Cao, H.; Liu, Z.; Bai, S.; Chen, L.; Yao, Y.; Zheng, H.-T.; and Li, Y. 2024. Moleculeqa: A dataset to evaluate factual accuracy in molecular comprehension. In Findings of the Association for Computational Linguistics: EMNLP 2024 , 3769--3789

2024

[61] [73]

S.; Aykol, M.; Cheon, G.; and Cubuk, E

Merchant, A.; Batzner, S.; Schoenholz, S. S.; Aykol, M.; Cheon, G.; and Cubuk, E. D. 2023. Scaling deep learning for materials discovery. Nature 624(7990):80--85

2023

[62] [74]

I.; Yu, F.; and Staar, P

Morin, L.; Weber, V.; Meijer, G. I.; Yu, F.; and Staar, P. W. 2024. Patcid: an open-access dataset of chemical structures in patent documents. Nature Communications 15(1):6532

2024

[63] [75]

Nagar, A.; Jaiswal, S.; and Tan, C. 2024. Zero-shot visual reasoning by vision-language models: Benchmarking and analysis. In 2024 International Joint Conference on Neural Networks (IJCNN) , 1--8. IEEE

2024

[64] [76]

Pei, Q.; Zhang, W.; Zhu, J.; Wu, K.; Gao, K.; Wu, L.; Xia, Y.; and Yan, R. 2023. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , 1102--1123

2023

[65] [77]

W.; and Barzilay, R

Qian, Y.; Guo, J.; Tu, Z.; Li, Z.; Coley, C. W.; and Barzilay, R. 2023. Molscribe: robust molecular structure recognition with image-to-graph generation. Journal of chemical information and modeling 63(7):1925--1934

2023

[66] [78]

W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning , 8748--8763. PmLR

2021

[67] [79]

Rajan, K.; Zielesny, A.; and Steinbeck, C. 2020. Decimer: towards deep learning for chemical image recognition. Journal of Cheminformatics 12(1):65

2020

[68] [80]

M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.; Donghia, N

Stokes, J. M.; Yang, K.; Swanson, K.; Jin, W.; Cubillos-Ruiz, A.; Donghia, N. M.; MacNair, C. R.; French, S.; Carfrae, L. A.; Bloom-Ackermann, Z.; et al. 2020. A deep learning approach to antibiotic discovery. Cell 180(4):688--702

2020

[69] [81]

Tan, Q.; Zhou, D.; Xia, P.; Liu, W.; Ouyang, W.; Bai, L.; Li, Y.; and Fu, T. 2025. Chemmllm: Chemical multimodal large language model. arXiv preprint arXiv:2505.16326

work page arXiv 2025

[70] [82]

Veli c kovi \'c , P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903

work page internal anchor Pith review Pith/arXiv arXiv 2017

[71] [83]

Weininger, D. 1988. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules. Journal of chemical information and computer sciences 28(1):31--36

1988

[72] [84]

N.; Gomes, J.; Geniesse, C.; Pappu, A

Wu, Z.; Ramsundar, B.; Feinberg, E. N.; Gomes, J.; Geniesse, C.; Pappu, A. S.; Leswing, K.; and Pande, V. 2018. Moleculenet: a benchmark for molecular machine learning. Chemical science 9(2):513--530

2018

[73] [85]

Xu, W.; Wang, J.; Wang, W.; Chen, Z.; Zhou, W.; Yang, A.; Lu, L.; Li, H.; Wang, X.; Zhu, X.; et al. 2025. Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models. arXiv preprint arXiv:2504.15279

work page arXiv 2025

[74] [86]

Zhang, D.; Liu, W.; Tan, Q.; Chen, J.; Yan, H.; Yan, Y.; Li, J.; Huang, W.; Yue, X.; Ouyang, W.; et al. 2024. Chemllm: A chemical large language model. arXiv preprint arXiv:2402.06852

work page arXiv 2024

[75] [87]

Zhang, Y.; Ye, G.; Yuan, C.; Han, B.; Huang, L.-K.; Yao, J.; Liu, W.; and Rong, Y. 2025. Atomas: Hierarchical adaptive alignment on molecule-text for unified molecule understanding and generation. In The Thirteenth International Conference on Learning Representations

2025