pith. machine review for the scientific record. sign in

arxiv: 2605.10157 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

MolSight: Molecular Property Prediction with Images

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:01 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords molecular property predictionvision-based modelscurriculum learning2D molecular imagesbond-line diagramsdrug discoveryquantum chemistry
0
0 comments X

The pith

A single 2D bond-line image processed by a vision encoder is sufficient for competitive molecular property prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard vision models can extract enough chemical information from rendered 2D skeletal diagrams to predict molecular properties accurately across physical, drug-discovery, and quantum tasks. This sidesteps the usual requirements for building explicit graphs, generating 3D conformers, or training billion-parameter language models. A curriculum that trains the model on molecules ordered by increasing structural complexity further improves results over flat training. If the approach holds, property prediction becomes feasible with ordinary image pipelines and dramatically lower compute, opening the method to labs without specialized molecular software.

Core claim

Using ten vision architectures and two million pre-training images, the work shows that a vision encoder applied to a single rendered bond-line diagram achieves top or near-top performance on ten downstream benchmarks. The chemistry-informed curriculum divides pre-training molecules into five tiers using structural complexity descriptors and consistently outperforms non-curriculum training. The strongest configuration ranks first on five tasks and in the top two on all ten while requiring eighty times fewer FLOPs than the nearest multimodal competitor.

What carries the argument

Vision encoder on rendered bond-line images, trained with a curriculum that partitions molecules into five tiers of increasing structural complexity via five descriptors.

If this is right

  • Standard image classification pipelines can match or exceed graph and multimodal methods on physical-property regression, drug classification, and quantum prediction.
  • Ordering pre-training data by structural complexity descriptors yields consistent gains over random or flat training schedules.
  • Molecular property prediction becomes viable at eighty times lower computational cost than current leading multimodal systems.
  • Two-dimensional skeletal representations encode sufficient information for competitive accuracy across the tested physical, drug-discovery, and quantum tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could allow quick property estimates from hand-drawn sketches in laboratory settings where rendering software is available.
  • Learned visual patterns in diagrams may implicitly capture some three-dimensional or electronic features that are not explicitly drawn.
  • The curriculum approach might transfer to other scientific image domains where objects vary widely in structural complexity.

Load-bearing premise

Two-dimensional bond-line diagrams contain enough chemical information to support accurate property prediction without explicit three-dimensional geometry or graph connectivity.

What would settle it

A curriculum-trained vision model that underperforms graph-based baselines on quantum-chemistry tasks known to depend on three-dimensional conformation would falsify the sufficiency claim.

Figures

Figures reproduced from arXiv: 2605.10157 by Aaditya Baranwal, Akshaj Gupta, Shruti Vyas, Yogesh S Rawat.

Figure 1
Figure 1. Figure 1: The MolSight study: Architectures, strategies and downstream [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example input samples from Tier 0 to Tier 4 from the Pretraining Datasets. 1 Introduction Since the 19th century, chemists have communicated molecular structure through 2D skeletal diagrams, a visual language so expressive that an expert can infer reactivity, polarity, and rough pharmacological profiles at a glance. Every com￾pound in PubChem, ChEMBL, and ZINC can be rendered as such a diagram from a SMILE… view at source ↗
Figure 3
Figure 3. Figure 3: Pre-training corpus descriptor distributions for MolTextNet-1M and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pre-training convergence with and without curriculum scheduling. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Curriculum tier (CT) characterisation: (a) Tier distribution (molecule counts) for MolTextNet-1M and PCQM4Mv2. PCQM4Mv2 is dominated by Tiers 0–2 (simple quantum-chemistry molecules); MolTextNet harbours a richer Tier 3–4 pop￾ulation (complex drug-like structures). (b) Boxplots of Bertz complexity CT per tier for MolTextNet-1M, confirming monotone, well-separated complexity increase across tiers. (c) CT vs… view at source ↗
Figure 6
Figure 6. Figure 6: Classification performance vs. GNN pre-training baselines (ROC￾AUC %). MolSight (deep rose) achieves the highest AUC on 4 of 5 tasks, competi￾tive with graph-based methods that operate on explicit molecular topology. Uni-Mol† (3D conformers) leads only on ClinTox. SigLIP 2 EVA-CLIP SigLIP OpenCLIP MetaCLIP Architecture S6 S5 S3 S7 S4 S2 S1 Pre-training Strategy 0 20 40 60 80 100 Relative Performance Score … view at source ↗
Figure 8
Figure 8. Figure 8: Progressive improvement across pre-training strategies (SigLIP2). [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Regression performance Legend shared with [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: t-SNE of QM9 validation embeddings. S6: representations align smoothly with atomization energy (U0), whereas S3 exhibits coarse clustering. able directly from 2D renderings without explicit topological graph construction. On energy regression, QM9-U0 MAE drops to 1.99 vs. MOLINTERACT’s 7.72 (−74.2%), and ∆ε MAE to 0.00697 vs. 0.0356 (−80.4%). Only Uni-Mol, which operates on ground-truth 3D conformers, ret… view at source ↗
Figure 12
Figure 12. Figure 12: Last-layer attention comparison (S3 vs. S6). [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Evolution of the embedding space for QM9 HOMO-LUMO Gap [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: OpenCLIP t-SNE embeddings across the supervision ladder. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MolSight, the first large-scale study of vision-based molecular property prediction using rendered 2D bond-line diagrams as input to standard vision encoders. It evaluates 10 architectures and 7 pre-training strategies on 2M molecule images across 10 downstream tasks (physical-property regression, drug-discovery classification, quantum-chemistry prediction). A chemistry-informed curriculum is proposed that partitions the pre-training corpus into five tiers using structural complexity descriptors; the best curriculum-trained model reports top-1 results on 5 of 10 benchmarks and top-2 on all 10, at 80× lower FLOPs than the nearest multi-modal competitor. The central claim is that a single 2D skeletal image processed by a vision encoder is sufficient for competitive chemical property prediction.

Significance. If the empirical results hold after addressing the noted gaps, the work would establish that 2D bond-line renderings alone can deliver competitive performance across diverse MPP tasks, offering a simpler and far more efficient alternative to graph neural networks, 3D conformer models, or large language models. The systematic comparison of 10 vision backbones and the curriculum-learning strategy constitute clear contributions; the 80× FLOPs reduction is a practically important finding if the baseline comparison is fully documented.

major comments (3)
  1. [§4, Table 5] §4 (Quantum-chemistry benchmarks) and Table 5: the top-two results on QM9-derived tasks (atomization energy, dipole moment, HOMO-LUMO gap) are presented without any ablation that isolates 3D dependence. No experiments compare performance on fixed versus randomized conformers, on stereochemically explicit versus non-explicit renderings, or on 2D-topology-only subsets; without these controls it remains unclear whether the vision encoder recovers genuine 3D information or merely exploits dataset-specific 2D–property correlations.
  2. [§3.2] §3.2 (Curriculum construction): the five structural-complexity descriptors are used to define tiers, yet no quantitative validation is given that this ordering improves generalization beyond the specific pre-training corpus (e.g., no cross-corpus transfer experiment or comparison against random ordering with matched compute). The claim that the curriculum is “chemistry-informed” therefore rests on the descriptors alone rather than on demonstrated causal benefit.
  3. [§5] §5 (Efficiency comparison): the 80× FLOPs reduction versus the nearest multi-modal competitor is stated without an explicit table listing the competitor’s architecture, input resolution, and exact FLOPs calculation; the factor cannot be reproduced from the given information and is load-bearing for the efficiency claim.
minor comments (2)
  1. [Figure 2] Figure 2 (curriculum tier examples): the rendered images are too small to verify that the five complexity descriptors are visually distinguishable; larger insets or additional examples would improve clarity.
  2. [Table 1] Table 1 (benchmark summary): the column headers for metric type (MAE vs. ROC-AUC) are not repeated on every page of the table; readers must scroll to confirm units.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications where the manuscript already supports the claims and outlining targeted revisions to strengthen the empirical support.

read point-by-point responses
  1. Referee: [§4, Table 5] §4 (Quantum-chemistry benchmarks) and Table 5: the top-two results on QM9-derived tasks (atomization energy, dipole moment, HOMO-LUMO gap) are presented without any ablation that isolates 3D dependence. No experiments compare performance on fixed versus randomized conformers, on stereochemically explicit versus non-explicit renderings, or on 2D-topology-only subsets; without these controls it remains unclear whether the vision encoder recovers genuine 3D information or merely exploits dataset-specific 2D–property correlations.

    Authors: We agree that explicit controls would strengthen the interpretation. Because every input is a 2D bond-line rendering with no 3D coordinates or conformer data provided to the model, the encoder has no mechanism to recover genuine 3D geometry; predictive accuracy on QM9 tasks must therefore derive from 2D topological and visual features that happen to correlate with the 3D-derived labels. We will revise §4 to state this limitation explicitly and add a focused ablation comparing performance on renderings that include versus omit stereochemical indicators (wedge/dash bonds) for the relevant QM9 subsets. Full randomization of conformers is not applicable to our 2D pipeline, but we will note this as a natural direction for future 3D-aware rendering studies. revision: partial

  2. Referee: [§3.2] §3.2 (Curriculum construction): the five structural-complexity descriptors are used to define tiers, yet no quantitative validation is given that this ordering improves generalization beyond the specific pre-training corpus (e.g., no cross-corpus transfer experiment or comparison against random ordering with matched compute). The claim that the curriculum is “chemistry-informed” therefore rests on the descriptors alone rather than on demonstrated causal benefit.

    Authors: The five descriptors are standard cheminformatics measures of structural complexity and were chosen precisely because they reflect increasing chemical difficulty. To supply the requested empirical validation, we will add an ablation in the revised §3.2 that trains an otherwise identical model using a random tier ordering under the same total compute budget and reports the resulting downstream performance gap, thereby demonstrating the causal benefit of the chemistry-informed schedule. revision: yes

  3. Referee: [§5] §5 (Efficiency comparison): the 80× FLOPs reduction versus the nearest multi-modal competitor is stated without an explicit table listing the competitor’s architecture, input resolution, and exact FLOPs calculation; the factor cannot be reproduced from the given information and is load-bearing for the efficiency claim.

    Authors: We apologize for the missing detail. In the revised §5 we will insert a dedicated table that enumerates the competitor architecture, input resolution, batch-size assumptions, and the exact FLOPs computation (including any hardware-specific scaling factors) used to obtain the reported 80× reduction, making the comparison fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison

full rationale

The paper reports an empirical study training vision encoders on rendered 2D molecular images and measuring performance on 10 external downstream benchmarks. No mathematical derivations, parameter fits presented as predictions, or self-citation load-bearing steps appear in the provided text. All claims reduce to measured accuracy/FLOPs on held-out tasks rather than any quantity defined in terms of the model's own outputs or prior self-referential results.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the empirical effectiveness of vision models on rendered 2D images and the utility of the proposed curriculum; it inherits standard supervised learning assumptions and the domain premise that 2D drawings suffice for the chosen properties.

free parameters (1)
  • number of curriculum tiers
    Five tiers chosen to partition pre-training molecules by structural complexity descriptors.
axioms (2)
  • domain assumption Rendered 2D bond-line images contain sufficient chemical information for the target properties
    Invoked when claiming that sight alone is sufficient.
  • ad hoc to paper Curriculum ordering by structural complexity improves generalization over random ordering
    Proposed and tested in the paper without prior theoretical guarantee.
invented entities (1)
  • chemistry-informed curriculum no independent evidence
    purpose: Partition pre-training corpus into five tiers of increasing structural difficulty
    New training strategy introduced to improve vision model performance on molecules.

pith-pipeline@v0.9.0 · 5541 in / 1495 out tokens · 49453 ms · 2026-05-12T04:01:32.682556+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2025) 2, 3, 12, 18, 29

    Adak, D., Rawat, Y.S., Vyas, S.: Molvision: Molecular property prediction with vision language models. In: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track (2025) 2, 3, 12, 18, 29

  2. [2]

    In: Advances in Neural Information Processing Systems (NeurIPS)

    Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: Learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 29 (2016) 4

  3. [3]

    Artificial Intelligence Chemistry4(1), 100118 (2026).https: //doi.org/10.1016/j.aichem.2026.1001183

    Baranwal, A., Vyas, S.: ChemPro: A progressive chemistry benchmark for large language models. Artificial Intelligence Chemistry4(1), 100118 (2026).https: //doi.org/10.1016/j.aichem.2026.1001183

  4. [4]

    In: Pro- ceedings of the 26th International Conference on Machine Learning, ICML 2009

    Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Pro- ceedings of the 26th International Conference on Machine Learning, ICML 2009. pp. 41–48. ACM (2009) 7

  5. [5]

    Journal of the Amer- ican Chemical Society103(12), 3599–3601 (1981) 8, 23

    Bertz, S.H.: The first general index of molecular complexity. Journal of the Amer- ican Chemical Society103(12), 3599–3601 (1981) 8, 23

  6. [6]

    In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for con- trastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020. Proceedings of Machine Learning Research, vol. 119, pp. 1597–1607. PMLR (2020) 3, 5

  7. [7]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 3

  8. [8]

    In: Machine Learning for Molecules Workshop at NeurIPS 2020 (2020) 3, 18

    Chithrananda, S., Grand, G., Ramsundar, B.: ChemBERTa: Large-scale self- supervised pretraining for molecular property prediction. In: Machine Learning for Molecules Workshop at NeurIPS 2020 (2020) 3, 18

  9. [9]

    Journal of Chemical Information and Computer Sciences44(3), 1000–1005 (2004) 7

    Delaney, J.S.: ESOL: Estimating aqueous solubility directly from molecular struc- ture. Journal of Chemical Information and Computer Sciences44(3), 1000–1005 (2004) 7

  10. [10]

    Imagenet: A large- scale hierarchical image database

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009. 52068485

  11. [11]

    In: Proceedings of the 2022 Conference on Em- pirical Methods in Natural Language Processing (EMNLP)

    Edwards, C., Lai, T., Ros, K., Honke, G., Cho, K., Ji, H.: Translation between molecules and natural language. In: Proceedings of the 2022 Conference on Em- pirical Methods in Natural Language Processing (EMNLP). pp. 375–413 (2022). https://doi.org/10.18653/v1/2022.emnlp-main.262, 3, 18

  12. [12]

    Journal of Cheminformatics1(1), 8 (2009) 23

    Ertl, P., Schuffenhauer, A.: Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. Journal of Cheminformatics1(1), 8 (2009) 23

  13. [13]

    In: Proceedings of the 34th International Confer- enceonMachineLearning,ICML2017.ProceedingsofMachineLearningResearch, vol

    Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Confer- enceonMachineLearning,ICML2017.ProceedingsofMachineLearningResearch, vol. 70, pp. 1263–1272. PMLR (2017) 2, 3

  14. [14]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Huo, F., Xu, W., Guo, J., Wang, H., Guo, S.: C2KD: Bridging the modality gap for cross-modal knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16006–16015 (2024) 4

  15. [15]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 4 16 Aaditya Baranwal et al

    Ji, Y., Chen, Y., Yang, L., Ding, R., Yang, M., Zheng, X.: VeXKD: The versatile integration of cross-modal fusion and knowledge distillation for 3D perception. In: Advances in Neural Information Processing Systems (NeurIPS) (2024) 4 16 Aaditya Baranwal et al

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 4

    Kim,S.,Xiao,R.,Georgescu,M.I.,Alaniz,S.,Akata,Z.:COSMOS:Cross-modality self-distillation for vision language pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025) 4

  17. [17]

    Patterns3(10), 100588 (2022) 3

    Krenn, M., Ai, Q., Barthel, S., Carson, N., Falk, A., Duber, F., Harper, P., Jost, M., Magar, R., Rose, K., et al.: SELFIES and the future of molecular string rep- resentations. Patterns3(10), 100588 (2022) 3

  18. [18]

    In: Proceedings of the 40th International Conference on Machine Learning (ICML)

    Liu, S., Du, W., Ma, Z.M., Guo, H., Tang, J.: A group symmetric stochastic dif- ferential equation model for molecule multi-modal pretraining. In: Proceedings of the 40th International Conference on Machine Learning (ICML). Proceedings of Machine Learning Research, vol. 202 (2023) 21

  19. [19]

    In: International Conference on Learning Representations (ICLR) (2022) 21

    Liu, S., Wang, H., Liu, W., Lasenby, J., Guo, H., Tang, J.: Pre-training molecular graph representation with 3D geometry. In: International Conference on Learning Representations (ICLR) (2022) 21

  20. [20]

    Nature Communications 15(1), 7104 (2024).https://doi.org/10.1038/s41467-024-51321-w3

    Lu, S., Gao, Z., He, D., Zhang, L., Ke, G.: Data-driven quantum chemical property prediction leveraging 3D conformations with Uni-Mol+. Nature Communications 15(1), 7104 (2024).https://doi.org/10.1038/s41467-024-51321-w3

  21. [21]

    OpenAI: GPT-4o.https://openai.com/index/hello-gpt-4o/(2024) 3

  22. [22]

    Transactions on Machine Learning Research (TMLR) (2024) 3

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,etal.:DINOv2:Learningrobust visual features without supervision. Transactions on Machine Learning Research (TMLR) (2024) 3

  23. [23]

    In: Proceedings of the 38th In- ternational Conference on Machine Learning, ICML 2021

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th In- ternational Conference on Machine Learning, ICML 2021. Proceedings of Machine Learning Research, vol. 139, ...

  24. [24]

    Scientific Data1, 140022 (2014) 7

    Ramakrishnan, R., Dral, P.O., Rupp, M., von Lilienfeld, O.A.: Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data1, 140022 (2014) 7

  25. [25]

    In: Advances in Neural Infor- mation Processing Systems (NeurIPS)

    Rampášek, L., Galkin, M., Dwivedi, V.P., Lim, A.T., Wolf, G., Beaini, D.: Recipe for a general, powerful, scalable graph transformer. In: Advances in Neural Infor- mation Processing Systems (NeurIPS). vol. 35, pp. 14501–14515 (2022) 3

  26. [26]

    In: Advances in Neural Informa- tion Processing Systems (NeurIPS)

    Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., Huang, J.: Self-supervised graph transformer on large-scale molecular data. In: Advances in Neural Informa- tion Processing Systems (NeurIPS). vol. 33, pp. 12559–12571 (2020) 3

  27. [27]

    In: IEEE International Conference on Computer Vision (ICCV)

    Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad- CAM: Visual explanations from deep networks via gradient-based localization. In: IEEE International Conference on Computer Vision (ICCV). pp. 618–626 (2017) 14

  28. [28]

    In: NeurIPS 2024 Workshop on AI for New Drug Modalities (2024) 21

    Soga, P., Lei, Z., Bilodeau, C., Li, J.: Deep interactions for multimodal molecular property prediction. In: NeurIPS 2024 Workshop on AI for New Drug Modalities (2024) 21

  29. [29]

    In: International Conference on Machine Learning (ICML)

    Stärk, H., Beaini, D., Corso, G., Tossou, P., Dallago, C., Günnemann, S., Liò, P.: 3D infomax improves GNNs for molecular property prediction. In: International Conference on Machine Learning (ICML). pp. 20479–20502 (2022) 21

  30. [30]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Sun, Q., Fang, Y., Wu, L., Wang, X., Cao, Y.: EVA-CLIP: Improved training techniques for CLIP at scale. arXiv preprint arXiv:2303.15389 (2023) 3, 21

  31. [31]

    Chemical Science9(2), 513–530 (2018) 7 MolSight: Molecular Property Prediction with Images 17

    Wu, Z., Ramsundar, B., Feinberg, E.N., Gomes, J., Geniesse, C., Pappu, A.S., Leswing,K.,Pande,V.:MoleculeNet:Abenchmarkformolecularmachinelearning. Chemical Science9(2), 513–530 (2018) 7 MolSight: Molecular Property Prediction with Images 17

  32. [32]

    Journal of Medicinal Chemistry 63(16), 8749–8760 (2020) 3, 18

    Xiong, Z., Wang, D., Liu, X., Zhong, F., Wan, X., Li, X., Li, Z., Luo, X., Chen, K., Jiang, H., Zheng, M.: Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. Journal of Medicinal Chemistry 63(16), 8749–8760 (2020) 3, 18

  33. [33]

    Journal of Chemical Information and Modeling 59(8), 3370–3388 (2019) 2, 3, 18

    Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-Perez, A., Hopper, T., Kelley, B., Mathea, M., et al.: Analyzing learned molecular repre- sentations for property prediction. Journal of Chemical Information and Modeling 59(8), 3370–3388 (2019) 2, 3, 18

  34. [34]

    In: International Conference on Learning Representations (ICLR) (2024) 21

    Yu, Q., Zhang, Y., Ni, Y., Feng, S., Lan, Y., Zhou, H., Liu, J.: Multimodal molec- ular pretraining via modality blending. In: International Conference on Learning Representations (ICLR) (2024) 21

  35. [35]

    In: IEEE/CVF International Conference on Computer Vision (ICCV)

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11941–11952 (2023) 3, 6, 22

  36. [36]

    In: European Conference on Computer Vision (ECCV) (2024) 3

    Zhang, B., Zhang, P., Dong, X., Zang, Y., Wang, J.: Long-CLIP: Unlocking the long-text capability of CLIP. In: European Conference on Computer Vision (ECCV) (2024) 3

  37. [37]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 4

    Zhao, L., Song, J., Skinner, K.A.: CRKD: Enhanced camera-radar object detec- tion with cross-modality knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) 4

  38. [38]

    In: International Conference on Learning Representations (ICLR) (2023) 2, 3, 6, 18, 21

    Zhou, G., Gao, Z., Ding, Q., Zheng, H., Xu, H., Wei, Z., Zhang, L., Ke, G.: Uni- mol: A universal 3d molecular representation learning framework. In: International Conference on Learning Representations (ICLR) (2023) 2, 3, 6, 18, 21

  39. [39]

    hard-start

    Zhu, H., Martin, T.M., Ye, L., Sedykh, A., Young, D.M., Tropsha, A.: Quantitative structure–activity relationship modeling of rat acute toxicity by oral exposure. Chemical Research in Toxicology22(12), 1913–1921 (2009) 7 18 Aaditya Baranwal et al. MolSight: Molecular Property Prediction with Images (Supplementary Material) This supplementary material prov...