De novo molecular generation with optical property preconditioning at the token level

Al\'an Aspuru-Guzik; Haozhe Huang; Hyun Suk Park; Jorge A. Campos-Gonzalez-Angulo; Manuel Gonzalez Lastre; Xinjian Liu

arxiv: 2606.08221 · v1 · pith:S2SJUVB4new · submitted 2026-06-06 · 💻 cs.LG

De novo molecular generation with optical property preconditioning at the token level

Haozhe Huang , Manuel Gonzalez Lastre , Hyun Suk Park , Jorge A. Campos-Gonzalez-Angulo , Xinjian Liu , Al\'an Aspuru-Guzik This is my paper

Pith reviewed 2026-06-27 20:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords de novo molecular generationOLED designtoken conditioningautoregressive language modeloptical propertiesTDDFT evaluationchemotype analysisproperty control

0 comments

The pith

Token conditioning in a language model directs optical properties of generated OLED molecules, but controllability varies by chemical motif.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks a token-conditioned autoregressive language model for generating molecules with targeted vertical absorption energy and oscillator strength in a low-data OLED design setting. It pretrains GPT2 on chemical corpora, augments it with discrete property tokens, and fine-tunes via multi-task optimization before evaluating outputs at the TDDFT level. A sympathetic reader would care because the work quantifies directional control at the token level while revealing that aggregate property distributions alone miss important failures tied to specific electronic environments. The central finding is that control works consistently in direction across bins yet shows local irregularities and strong dependence on motifs such as aromatic carbons versus electron-withdrawing groups.

Core claim

A GPT2 model pretrained on chemical corpora, then fine-tuned with discrete tokens for absorption energy, oscillator strength, and an auxiliary HOMO-LUMO gap, generates molecules whose TDDFT properties reproduce the dominant support of the training distribution while shifting toward lower molecular weight; token-level control remains consistently directional across conditioning bins though not fully orthogonal and exhibits local calibration irregularities, with controllability improving for moderately conjugated aromatic-carbon motifs and degrading for electron-withdrawing motifs such as aryl nitriles.

What carries the argument

Discrete property tokens inserted into a pretrained autoregressive language model to enable token-level conditioning on vertical absorption energy and oscillator strength during multi-task fine-tuning.

If this is right

The generated library reproduces the training optical-property support while favoring smaller molecules with fewer heavy atoms.
Token-level control is directional across bins but not fully orthogonal and shows local calibration irregularities.
Controllability improves for moderately conjugated aromatic-carbon motifs and degrades for electron-withdrawing motifs such as aryl nitriles.
Reliability assessment requires chemotype-resolved analysis rather than aggregate distributions alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future generative models may require motif-aware or environment-specific conditioning to improve performance on electron-withdrawing groups.
The same token-preconditioning approach could be tested on other sparse-data molecular design tasks such as drug-like property targeting.
Higher-level electronic-structure methods or direct experimental feedback loops would be needed to move beyond TDDFT proxies.

Load-bearing premise

TDDFT calculations on the generated molecules serve as a reliable proxy for both distributional fidelity and the success of token-level control.

What would settle it

Measuring experimental absorption spectra and oscillator strengths for a set of generated molecules and checking whether they match the conditioned targets within expected error would falsify the controllability claims if systematic deviations appear.

read the original abstract

Designing OLED molecules with targeted optical properties remains challenging due to the scarcity of high-quality data and the limited reliability of conditional control in generative models across chemical motifs. Here, we benchmark a token-conditioned autoregressive language model for OLED molecular generation in a realistic low-data regime. A GPT2 model is pretrained on large chemical corpora, augmented with discrete property tokens, and fine-tuned using multi-task optimisation. Conditioning targets vertical absorption energy and oscillator strength, with the HOMO-LUMO gap included as an auxiliary electronic descriptor. Generated molecules are evaluated at the TDDFT level to assess distributional fidelity and controllability. The generated library reproduces the dominant optical-property support of the training distribution while shifting towards lower molecular weight and fewer heavy atoms. Token-level control is consistently directional across conditioning bins, but is not fully orthogonal and exhibits local calibration irregularities. A chemotype-resolved analysis further shows that controllability depends strongly on local electronic environments: moderately conjugated aromatic-carbon motifs are associated with improved joint target satisfaction, whereas electron-withdrawing motifs, particularly aryl nitriles, show systematic red-shifting and reduced controllability. These results establish a quantitative benchmark for conditional OLED molecular generation and show that model reliability must be assessed in chemically meaningful subspaces rather than from aggregate property distributions alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Token-conditioned GPT2 delivers directional control over OLED absorption and oscillator strength with motif breakdowns, but everything rests on TDDFT evaluation.

read the letter

The main takeaway is that this paper takes a standard GPT2 setup with discrete property tokens, fine-tunes it on limited OLED data, and shows the conditioning moves generated molecules in the intended direction for vertical absorption energy and oscillator strength. They add a chemotype breakdown that links better joint target satisfaction to moderately conjugated aromatic motifs and worse performance plus red-shifting to electron-withdrawing groups like aryl nitriles.

What stands out is the concrete benchmark in a realistic low-data regime plus the explicit check that aggregate metrics hide motif-dependent behavior. The directional control holds across bins, and the auxiliary HOMO-LUMO token helps without fully orthogonalizing the targets. That part is useful for people already running similar autoregressive generators on materials.

The soft spot is the exclusive reliance on TDDFT for all fidelity and controllability numbers. TDDFT is known to struggle with charge-transfer states and electron-withdrawing substituents, exactly the motifs where the paper reports the largest irregularities. Without any cross-check against wavefunction methods or experimental spectra, it is hard to tell whether the reported motif effects are real electronic-structure signals or level-of-theory artifacts. The abstract does not mention data-split details or training ablations either, though the full text might clarify those.

This is the kind of targeted extension that computational chemists working on conditional molecular generation will want to see. It is not a foundational advance, but the motif-resolved diagnostics give a practical handle that aggregate benchmarks miss. I would send it to referees; the core claim about needing subspace evaluation is worth checking against the methods and any additional validation they provide.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks a GPT2 autoregressive language model pretrained on chemical corpora and fine-tuned with multi-task optimization using discrete token-level conditioning on vertical absorption energy, oscillator strength, and the HOMO-LUMO gap as an auxiliary descriptor. Generated molecules are evaluated exclusively at the TDDFT level for distributional fidelity and controllability. The results show directional token control across conditioning bins (though not fully orthogonal, with local calibration issues) and motif-dependent performance: moderately conjugated aromatic-carbon motifs improve joint target satisfaction while electron-withdrawing motifs (e.g., aryl nitriles) exhibit systematic red-shifting and reduced controllability. The central conclusion is that model reliability must be assessed in chemically meaningful subspaces rather than aggregate distributions alone.

Significance. If the TDDFT proxy evaluations prove reliable, the work supplies a quantitative benchmark for conditional de novo molecular generation in a realistic low-data regime and demonstrates the practical value of chemotype-resolved analysis for identifying subspaces where control is effective. This could inform more robust deployment of generative models in materials applications such as OLED design.

major comments (2)

[Abstract and chemotype-resolved analysis] Abstract and chemotype-resolved analysis: All controllability metrics, directional claims, and motif-specific findings (e.g., red-shifting for aryl nitriles, improved joint satisfaction for moderately conjugated aromatics) are derived solely from TDDFT-computed vertical excitations and oscillator strengths. TDDFT is known to exhibit systematic errors for charge-transfer states and electron-withdrawing groups; without cross-checks against wavefunction methods (ADC(2), CC2, CASPT2) or experimental spectra, the reported motif-dependent irregularities cannot be distinguished from level-of-theory artifacts, directly undermining the load-bearing claim that reliability must be assessed in chemically meaningful subspaces.
[Results on token-level control] Results on token-level control: The abstract reports that control is 'consistently directional across conditioning bins' yet 'not fully orthogonal' with 'local calibration irregularities.' No details are provided on data splits, how conditioning bins are constructed, or whether post-hoc motif analysis influences the metrics; this leaves open the possibility of selection effects that would affect the orthogonality and calibration conclusions.

minor comments (2)

The abstract would benefit from explicit mention of training corpus size, fine-tuning dataset size, and the precise discretization scheme for the property tokens to allow immediate assessment of the low-data regime.
Notation for the conditioning tokens and the multi-task loss could be clarified with a short equation or table in the methods section for readers outside the immediate subfield.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for these constructive comments, which correctly identify key limitations in our evaluation strategy. We respond to each major comment below and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and chemotype-resolved analysis] Abstract and chemotype-resolved analysis: All controllability metrics, directional claims, and motif-specific findings (e.g., red-shifting for aryl nitriles, improved joint satisfaction for moderately conjugated aromatics) are derived solely from TDDFT-computed vertical excitations and oscillator strengths. TDDFT is known to exhibit systematic errors for charge-transfer states and electron-withdrawing groups; without cross-checks against wavefunction methods (ADC(2), CC2, CASPT2) or experimental spectra, the reported motif-dependent irregularities cannot be distinguished from level-of-theory artifacts, directly undermining the load-bearing claim that reliability must be assessed in chemically meaningful subspaces.

Authors: We agree that TDDFT exhibits well-documented systematic errors for charge-transfer states and electron-withdrawing groups, and that our motif-specific findings could in principle reflect level-of-theory artifacts rather than intrinsic model behavior. All results in the manuscript are obtained exclusively at the TDDFT level; no higher-level calculations (ADC(2), CC2, CASPT2) or experimental spectra are available. In revision we will add an explicit limitations paragraph qualifying the chemotype-resolved claims to the TDDFT approximation and noting that subspace analysis remains useful even within a single level of theory, but we cannot rule out artifacts without additional computations. revision: partial
Referee: [Results on token-level control] Results on token-level control: The abstract reports that control is 'consistently directional across conditioning bins' yet 'not fully orthogonal' with 'local calibration irregularities.' No details are provided on data splits, how conditioning bins are constructed, or whether post-hoc motif analysis influences the metrics; this leaves open the possibility of selection effects that would affect the orthogonality and calibration conclusions.

Authors: We will expand the methods section to specify that the dataset was randomly partitioned 80/10/10 (train/validation/test), that conditioning bins were defined as five equal-width intervals spanning the training-set property ranges, and that motif analysis was performed after generation and TDDFT evaluation on the complete generated set. Controllability metrics were computed on the full set prior to any motif stratification, so post-hoc analysis did not influence the reported orthogonality or calibration results. These details will be added in revision. revision: yes

standing simulated objections not resolved

Absence of higher-level wavefunction or experimental validation for the TDDFT-derived motif dependencies; such calculations are computationally prohibitive at the scale of the generated library.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims rest on token-conditioned autoregressive generation followed by independent TDDFT evaluation of generated molecules for absorption energy, oscillator strength, and motif-specific controllability. These evaluation quantities are computed externally and are not defined by or reduced to the conditioning tokens or model inputs. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the approach implicitly assumes that discrete property tokens can be learned as effective conditioning signals and that TDDFT is an adequate evaluator, but these are not enumerated as new postulates.

pith-pipeline@v0.9.1-grok · 5778 in / 1136 out tokens · 18170 ms · 2026-06-27T20:07:32.405270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 1 canonical work pages

[1]

Hirzel, David Duvenaud, Dougal Maclaurin, Martin A

Rafael Gómez-Bombarelli, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, David Duvenaud, Dougal Maclaurin, Martin A. Blood-Forsythe, Hyun Sik Chae, Markus Einzinger, Dong-Gwang Ha, Tony Wu, Georgios Markopoulos, Soonok Jeon, Hosuk Kang, Hiroshi Miyazaki, Masaki Numata, Sunghan Kim, Wenliang Huang, Seong Ik Hong, Marc Baldo, Ryan P. Adams, and Alán Aspuru-...

2016
[2]

Fluorescence and phosphorescence from higher excited states of organic molecules.Chemical Reviews, 12(8):4541–4568, May 2012

Takao Itoh. Fluorescence and phosphorescence from higher excited states of organic molecules.Chemical Reviews, 12(8):4541–4568, May 2012

2012
[3]

Greenman, William H

Kevin P. Greenman, William H. Green, and Rafael Gómez-Bombarelli. Multi-fidelity prediction of molecular optical peaks with deep learning.Chem. Sci., 13(4):1152–1162, 2022

2022
[4]

Son Gyo Jung, Guwon Jung, and Jacqueline M. Cole. Automatic Prediction of Peak Optical Absorption Wavelengths in Molecules Using Convolutional Neural Networks.Journal of Chemical Information and Modeling, 64(5):1486–1501, March 2024

2024
[5]

Greenman, Yunsie Chung, Shih-Cheng Li, David E

Esther Heid, Kevin P. Greenman, Yunsie Chung, Shih-Cheng Li, David E. Graff, Florence H. Vermeire, Haoyang Wu, William H. Green, and Charles J. McGill. Chemprop: A Machine Learning Package for Chemical Property Prediction.Journal of Chemical Information and Modeling, 64(1):9–17, January 2024

2024
[6]

Souza, Julio Cesar Duarte, Ronaldo R

Rubens C. Souza, Julio Cesar Duarte, Ronaldo R. Goldschmidt, and Itamar Borges Jr. Predicting Fluorescence Emission Wavelengths and Quantum Yields via Machine Learning, February 2025

2025
[7]

Pappu, and Vijay Pande

Han Altae-Tran, Bharath Ramsundar, Aneesh S. Pappu, and Vijay Pande. Low Data Drug Discovery with One-Shot Learning.ACS Central Science, 3(4):283–293, April 2017

2017
[8]

Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez- Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru- Guzik. Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

2018
[9]

CMGN: a conditional molec- ular generation net to design target-specific molecules with desired properties.Briefings in Bioinformatics, 24(4):bbad185, July 2023

Minjian Yang, Hanyu Sun, Xue Liu, Xi Xue, Yafeng Deng, and Xiaojian Wang. CMGN: a conditional molec- ular generation net to design target-specific molecules with desired properties.Briefings in Bioinformatics, 24(4):bbad185, July 2023

2023
[10]

Regression Transformer enables concurrent sequence regression and generation for molecular language modelling.Nature Machine Intelligence, 5(4):432–444, April 2023

Jannis Born and Matteo Manica. Regression Transformer enables concurrent sequence regression and generation for molecular language modelling.Nature Machine Intelligence, 5(4):432–444, April 2023

2023
[11]

Viraj Bagal, Rishal Aggarwal, P. K. Vinod, and U. Deva Priyakumar. MolGPT: Molecular Generation Using a Transformer-Decoder Model.J. Chem. Inf. Model., 62(9):2064–2076, May 2022

2064
[12]

LlaMol: a dynamic multi-conditional generative transformer language model for de novo molecular design.Journal of Cheminformatics, 16(1):73, 2024

Zhaowei Zhang, Jose Cedric Fernandez Soria, and Alex Zhavoronkov. LlaMol: a dynamic multi-conditional generative transformer language model for de novo molecular design.Journal of Cheminformatics, 16(1):73, 2024

2024
[13]

Token-Mol: target-guided molecular generation and optimization with limited data and chemical feedback.Nature Communications, 16(1):4416, 2025

Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Jingxuan Ge, Zhourui Wu, Yu Kang, Chang-Yu Hsieh, and Tingjun Hou. Token-Mol: target-guided molecular generation and optimization with limited data and chemical feedback.Nature Commun...

2025
[14]

Cheng, Chong Sun, and Al’an Aspuru-Guzik

Austin H. Cheng, Chong Sun, and Al’an Aspuru-Guzik. Scalable Autoregressive 3D Molecule Generation.arXiv preprint arXiv:2505.13791, May 2025

arXiv 2025
[15]

Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Anh T. N. Nguyen, Lauren T. May, Geoffrey I. Webb, and Shirui Pan. Large language models for scientific discovery in molecular property prediction.Nature Machine Intelligence, pages 1–11, February 2025

2025
[16]

Xingyao Niu, Yuanyuan Zhang, Dongge Ma, Ruixin Huang, Ye Yuan, Xiangbao Dong, Menghan Li, Xueliang Lu, and Dan Wei. A data-driven OLED candidate generation and optimization framework integrating machine learning, quantum chemistry simulation, and synthesis validation.Journal of Materials Informatics, 5(4):45, 2025

2025
[17]

Language models are unsupervised multitask learners.OpenAI, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI, 2019. tex.added-at: 2024-11-15T12:44:17.000+0100 tex.interhash: b926ece39c03cdf5499f6540cf63babd tex.intrahash: 33e4b003b64b1060334660fbf6db1f3f tex.timestamp: 2024-11- 15T12:44:17.000+0100. 15

2019
[18]

Hunter, Costas Bekas, and Alpha A

Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A. Hunter, Costas Bekas, and Alpha A. Lee. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction.ACS Central Science, 5(9):1572–1583, September 2019. Publisher: American Chemical Society

2019
[19]

Bellis, A

Anna Gaulton, Louisa J. Bellis, A. Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, and John P. Overington. ChEMBL: a large-scale bioactivity database for drug discovery.Nucleic Acids Research, 40(Database issue):D1100–D1107, January 2012

2012
[20]

Semi-supervised Sequence Learning

Andrew M Dai and Quoc V Le. Semi-supervised Sequence Learning. InAdvances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

2015
[21]

Improving Language Understanding by Generative Pre-Training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, 2018

2018
[22]

Multi-Task Learning as a Bargaining Game

Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-Task Learning as a Bargaining Game. InProceedings of the 39th International Conference on Machine Learning, pages 16428–16446. PMLR, June 2022. ISSN: 2640-3498

2022
[23]

Stay on Topic with Classifier-Free Guidance.arXiv preprint arXiv:2306.17806, October 2023

Guillaume Sanchez, Alexander Spangher, Honglu Fan, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. Stay on Topic with Classifier-Free Guidance.arXiv preprint arXiv:2306.17806, October 2023

arXiv 2023
[24]

Engler, Jimi van der Woning, Aude Kauffmann, Marc van Dijk, Mohammed El-Kebir, Koen M

Martin Stroet, Bertrand Caron, Martin S. Engler, Jimi van der Woning, Aude Kauffmann, Marc van Dijk, Mohammed El-Kebir, Koen M. Visscher, Josef Holownia, Callum Macfarlane, Brian J. Bennion, Svetlana Gelpi- Dominguez, Felice C. Lightstone, Tijs van der Storm, Daan P. Geerke, Alan E. Mark, and Gunnar W. Klau. OFraMP: a fragment-based tool to facilitate the...

2023
[25]

Analyzing Learned Molecular Representations for Property Prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, August 2019

Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing Learned Molecular Representations for Property Prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, Au...

2019
[26]

Beard, Ganesh Sivaraman, Álvaro Vázquez-Mayagoitia, Venkatram Vishwanath, and Jacqueline M

Edward J. Beard, Ganesh Sivaraman, Álvaro Vázquez-Mayagoitia, Venkatram Vishwanath, and Jacqueline M. Cole. Comparative dataset of experimental and computational attributes of UV/vis absorption spectra.Scientific Data, 6(1):307, December 2019

2019
[27]

Cheng-Wei Ju, Hanzhi Bai, Bo Li, and Rizhang Liu. Machine Learning Enables Highly Accurate Predictions of Photophysical Properties of Organic Fluorescent Materials: Emission Wavelengths and Quantum Yields.Journal of Chemical Information and Modeling, 61(3):1053–1065, March 2021

2021
[28]

Joung, Minhi Han, Minseok Jeong, and Sungnam Park

Joonyoung F. Joung, Minhi Han, Minseok Jeong, and Sungnam Park. Experimental database of optical properties of organic compounds.Scientific Data, 7(1):295, September 2020

2020
[29]

Oikonomopoulos, and Bjørn K

Vishwesh Venkatraman, Rajesh Raju, Solon P. Oikonomopoulos, and Bjørn K. Alsberg. The dye-sensitized solar cell database.Journal of Cheminformatics, 10(1):18, April 2018

2018
[30]

An Open Access Data Set Highlighting Aggregation of Dyes on Metal Oxides.Data, 5(2):45, June 2020

Vishwesh Venkatraman and Lethesh Kallidanthiyil Chellappan. An Open Access Data Set Highlighting Aggregation of Dyes on Metal Oxides.Data, 5(2):45, June 2020. Number: 2

2020
[31]

Hutchison

Naruki Yoshikawa and Geoffrey R. Hutchison. Fast, efficient fragment-based coordinate generation for Open Babel.Journal of Cheminformatics, 11(1):1–9, December 2019

2019
[32]

Automated exploration of the low-energy chemical space with fast quantum chemical methods.Physical Chemistry Chemical Physics, 22(14):7169–7192, April 2020

Philipp Pracht, Fabian Bohle, and Stefan Grimme. Automated exploration of the low-energy chemical space with fast quantum chemical methods.Physical Chemistry Chemical Physics, 22(14):7169–7192, April 2020

2020
[33]

Christoph Bannwarth, Sebastian Ehlert, and Stefan Grimme. GFN2-xTB——an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions.Journal of Chemical Theory and Computation, 15(3):1652–1671, March 2019

2019
[34]

Cheatham, Piotr Cieplak, Peter A

Jayashree Srinivasan, Thomas E. Cheatham, Piotr Cieplak, Peter A. Kollman, and David A. Case. Continuum Solvent Studies of the Stability of DNA, RNA, and Phosphoramidate−DNA Helices.Journal of the American Chemical Society, 120(37):9401–9409, September 1998

1998
[35]

Software update: The ORCA program system—version 6.0

Frank Neese. Software update: The ORCA program system—version 6.0.WIREs Computational Molecular Science, 15(2):e70019, 2025. tex.eprint: https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/wcms.70019

work page doi:10.1002/wcms.70019 2025
[36]

B97-3c: A revised low-cost variant of the B97-D density functional method.Journal of Chemical Physics, 148(6):064104, February 2018

Jan Gerit Brandenburg, Christoph Bannwarth, Andreas Hansen, and Stefan Grimme. B97-3c: A revised low-cost variant of the B97-D density functional method.Journal of Chemical Physics, 148(6):064104, February 2018. 16

2018
[37]

Efficient, approximate and parallel Hartree–Fock and hybrid DFT calculations

Frank Neese, Frank Wennmohs, Andreas Hansen, and Ute Becker. Efficient, approximate and parallel Hartree–Fock and hybrid DFT calculations. A ‘chain-of-spheres’ algorithm for the Hartree–Fock exchange.Chemical Physics, 356(1):98–109, February 2009

2009
[38]

Florian Weigend and Reinhart Ahlrichs. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy.Physical Chemistry Chemical Physics, 7(18):3297–3305, August 2005

2005
[39]

Accurate coulomb-fitting basis sets for H to rn.Physical Chemistry Chemical Physics, 8(9):1057– 1065, February 2006

Florian Weigend. Accurate coulomb-fitting basis sets for H to rn.Physical Chemistry Chemical Physics, 8(9):1057– 1065, February 2006

2006
[40]

Long-range corrected hybrid density functionals with damped atom–atom dispersion corrections.Physical Chemistry Chemical Physics, 10(44):6615–6620, November 2008

Jeng-Da Chai and Martin Head-Gordon. Long-range corrected hybrid density functionals with damped atom–atom dispersion corrections.Physical Chemistry Chemical Physics, 10(44):6615–6620, November 2008

2008
[41]

Long-range corrected hybrid density functionals with improved dispersion corrections.Journal of Chemical Theory and Computation, 9(1):263–272, January 2013

You-Sheng Lin, Guan-De Li, Shan-Ping Mao, and Jeng-Da Chai. Long-range corrected hybrid density functionals with improved dispersion corrections.Journal of Chemical Theory and Computation, 9(1):263–272, January 2013. tex.eprint: 26589028

2013
[42]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. arXiv:1810.04805 [cs]

Pith/arXiv arXiv 2019
[43]

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

2019
[44]

Uncertainty- Guided Optimization on Large Language Model Search Trees, October 2024

Julia Grosse, Ruotian Wu, Ahmad Rashid, Philipp Hennig, Pascal Poupart, and Agustinus Kristiadi. Uncertainty- Guided Optimization on Large Language Model Search Trees, October 2024. arXiv:2407.03951 [cs]

arXiv 2024
[45]

Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101, November 2017

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101, November 2017

Pith/arXiv arXiv 2017
[46]

Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of Cheminformatics, 1(1):8, June 2009

Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of Cheminformatics, 1(1):8, June 2009. 17 Supporting Information S0.1 Supplementary material index Contents overview.The list below summarizes the sections and major items included in the ...

2009

[1] [1]

Hirzel, David Duvenaud, Dougal Maclaurin, Martin A

Rafael Gómez-Bombarelli, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, David Duvenaud, Dougal Maclaurin, Martin A. Blood-Forsythe, Hyun Sik Chae, Markus Einzinger, Dong-Gwang Ha, Tony Wu, Georgios Markopoulos, Soonok Jeon, Hosuk Kang, Hiroshi Miyazaki, Masaki Numata, Sunghan Kim, Wenliang Huang, Seong Ik Hong, Marc Baldo, Ryan P. Adams, and Alán Aspuru-...

2016

[2] [2]

Fluorescence and phosphorescence from higher excited states of organic molecules.Chemical Reviews, 12(8):4541–4568, May 2012

Takao Itoh. Fluorescence and phosphorescence from higher excited states of organic molecules.Chemical Reviews, 12(8):4541–4568, May 2012

2012

[3] [3]

Greenman, William H

Kevin P. Greenman, William H. Green, and Rafael Gómez-Bombarelli. Multi-fidelity prediction of molecular optical peaks with deep learning.Chem. Sci., 13(4):1152–1162, 2022

2022

[4] [4]

Son Gyo Jung, Guwon Jung, and Jacqueline M. Cole. Automatic Prediction of Peak Optical Absorption Wavelengths in Molecules Using Convolutional Neural Networks.Journal of Chemical Information and Modeling, 64(5):1486–1501, March 2024

2024

[5] [5]

Greenman, Yunsie Chung, Shih-Cheng Li, David E

Esther Heid, Kevin P. Greenman, Yunsie Chung, Shih-Cheng Li, David E. Graff, Florence H. Vermeire, Haoyang Wu, William H. Green, and Charles J. McGill. Chemprop: A Machine Learning Package for Chemical Property Prediction.Journal of Chemical Information and Modeling, 64(1):9–17, January 2024

2024

[6] [6]

Souza, Julio Cesar Duarte, Ronaldo R

Rubens C. Souza, Julio Cesar Duarte, Ronaldo R. Goldschmidt, and Itamar Borges Jr. Predicting Fluorescence Emission Wavelengths and Quantum Yields via Machine Learning, February 2025

2025

[7] [7]

Pappu, and Vijay Pande

Han Altae-Tran, Bharath Ramsundar, Aneesh S. Pappu, and Vijay Pande. Low Data Drug Discovery with One-Shot Learning.ACS Central Science, 3(4):283–293, April 2017

2017

[8] [8]

Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez- Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru- Guzik. Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

2018

[9] [9]

CMGN: a conditional molec- ular generation net to design target-specific molecules with desired properties.Briefings in Bioinformatics, 24(4):bbad185, July 2023

Minjian Yang, Hanyu Sun, Xue Liu, Xi Xue, Yafeng Deng, and Xiaojian Wang. CMGN: a conditional molec- ular generation net to design target-specific molecules with desired properties.Briefings in Bioinformatics, 24(4):bbad185, July 2023

2023

[10] [10]

Regression Transformer enables concurrent sequence regression and generation for molecular language modelling.Nature Machine Intelligence, 5(4):432–444, April 2023

Jannis Born and Matteo Manica. Regression Transformer enables concurrent sequence regression and generation for molecular language modelling.Nature Machine Intelligence, 5(4):432–444, April 2023

2023

[11] [11]

Viraj Bagal, Rishal Aggarwal, P. K. Vinod, and U. Deva Priyakumar. MolGPT: Molecular Generation Using a Transformer-Decoder Model.J. Chem. Inf. Model., 62(9):2064–2076, May 2022

2064

[12] [12]

LlaMol: a dynamic multi-conditional generative transformer language model for de novo molecular design.Journal of Cheminformatics, 16(1):73, 2024

Zhaowei Zhang, Jose Cedric Fernandez Soria, and Alex Zhavoronkov. LlaMol: a dynamic multi-conditional generative transformer language model for de novo molecular design.Journal of Cheminformatics, 16(1):73, 2024

2024

[13] [13]

Token-Mol: target-guided molecular generation and optimization with limited data and chemical feedback.Nature Communications, 16(1):4416, 2025

Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Jingxuan Ge, Zhourui Wu, Yu Kang, Chang-Yu Hsieh, and Tingjun Hou. Token-Mol: target-guided molecular generation and optimization with limited data and chemical feedback.Nature Commun...

2025

[14] [14]

Cheng, Chong Sun, and Al’an Aspuru-Guzik

Austin H. Cheng, Chong Sun, and Al’an Aspuru-Guzik. Scalable Autoregressive 3D Molecule Generation.arXiv preprint arXiv:2505.13791, May 2025

arXiv 2025

[15] [15]

Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, Anh T. N. Nguyen, Lauren T. May, Geoffrey I. Webb, and Shirui Pan. Large language models for scientific discovery in molecular property prediction.Nature Machine Intelligence, pages 1–11, February 2025

2025

[16] [16]

Xingyao Niu, Yuanyuan Zhang, Dongge Ma, Ruixin Huang, Ye Yuan, Xiangbao Dong, Menghan Li, Xueliang Lu, and Dan Wei. A data-driven OLED candidate generation and optimization framework integrating machine learning, quantum chemistry simulation, and synthesis validation.Journal of Materials Informatics, 5(4):45, 2025

2025

[17] [17]

Language models are unsupervised multitask learners.OpenAI, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners.OpenAI, 2019. tex.added-at: 2024-11-15T12:44:17.000+0100 tex.interhash: b926ece39c03cdf5499f6540cf63babd tex.intrahash: 33e4b003b64b1060334660fbf6db1f3f tex.timestamp: 2024-11- 15T12:44:17.000+0100. 15

2019

[18] [18]

Hunter, Costas Bekas, and Alpha A

Philippe Schwaller, Teodoro Laino, Théophile Gaudin, Peter Bolgar, Christopher A. Hunter, Costas Bekas, and Alpha A. Lee. Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction.ACS Central Science, 5(9):1572–1583, September 2019. Publisher: American Chemical Society

2019

[19] [19]

Bellis, A

Anna Gaulton, Louisa J. Bellis, A. Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, and John P. Overington. ChEMBL: a large-scale bioactivity database for drug discovery.Nucleic Acids Research, 40(Database issue):D1100–D1107, January 2012

2012

[20] [20]

Semi-supervised Sequence Learning

Andrew M Dai and Quoc V Le. Semi-supervised Sequence Learning. InAdvances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

2015

[21] [21]

Improving Language Understanding by Generative Pre-Training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training, 2018

2018

[22] [22]

Multi-Task Learning as a Bargaining Game

Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, and Ethan Fetaya. Multi-Task Learning as a Bargaining Game. InProceedings of the 39th International Conference on Machine Learning, pages 16428–16446. PMLR, June 2022. ISSN: 2640-3498

2022

[23] [23]

Stay on Topic with Classifier-Free Guidance.arXiv preprint arXiv:2306.17806, October 2023

Guillaume Sanchez, Alexander Spangher, Honglu Fan, Elad Levi, Pawan Sasanka Ammanamanchi, and Stella Biderman. Stay on Topic with Classifier-Free Guidance.arXiv preprint arXiv:2306.17806, October 2023

arXiv 2023

[24] [24]

Engler, Jimi van der Woning, Aude Kauffmann, Marc van Dijk, Mohammed El-Kebir, Koen M

Martin Stroet, Bertrand Caron, Martin S. Engler, Jimi van der Woning, Aude Kauffmann, Marc van Dijk, Mohammed El-Kebir, Koen M. Visscher, Josef Holownia, Callum Macfarlane, Brian J. Bennion, Svetlana Gelpi- Dominguez, Felice C. Lightstone, Tijs van der Storm, Daan P. Geerke, Alan E. Mark, and Gunnar W. Klau. OFraMP: a fragment-based tool to facilitate the...

2023

[25] [25]

Analyzing Learned Molecular Representations for Property Prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, August 2019

Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, Andrew Palmer, Volker Settels, Tommi Jaakkola, Klavs Jensen, and Regina Barzilay. Analyzing Learned Molecular Representations for Property Prediction.Journal of Chemical Information and Modeling, 59(8):3370–3388, Au...

2019

[26] [26]

Beard, Ganesh Sivaraman, Álvaro Vázquez-Mayagoitia, Venkatram Vishwanath, and Jacqueline M

Edward J. Beard, Ganesh Sivaraman, Álvaro Vázquez-Mayagoitia, Venkatram Vishwanath, and Jacqueline M. Cole. Comparative dataset of experimental and computational attributes of UV/vis absorption spectra.Scientific Data, 6(1):307, December 2019

2019

[27] [27]

Cheng-Wei Ju, Hanzhi Bai, Bo Li, and Rizhang Liu. Machine Learning Enables Highly Accurate Predictions of Photophysical Properties of Organic Fluorescent Materials: Emission Wavelengths and Quantum Yields.Journal of Chemical Information and Modeling, 61(3):1053–1065, March 2021

2021

[28] [28]

Joung, Minhi Han, Minseok Jeong, and Sungnam Park

Joonyoung F. Joung, Minhi Han, Minseok Jeong, and Sungnam Park. Experimental database of optical properties of organic compounds.Scientific Data, 7(1):295, September 2020

2020

[29] [29]

Oikonomopoulos, and Bjørn K

Vishwesh Venkatraman, Rajesh Raju, Solon P. Oikonomopoulos, and Bjørn K. Alsberg. The dye-sensitized solar cell database.Journal of Cheminformatics, 10(1):18, April 2018

2018

[30] [30]

An Open Access Data Set Highlighting Aggregation of Dyes on Metal Oxides.Data, 5(2):45, June 2020

Vishwesh Venkatraman and Lethesh Kallidanthiyil Chellappan. An Open Access Data Set Highlighting Aggregation of Dyes on Metal Oxides.Data, 5(2):45, June 2020. Number: 2

2020

[31] [31]

Hutchison

Naruki Yoshikawa and Geoffrey R. Hutchison. Fast, efficient fragment-based coordinate generation for Open Babel.Journal of Cheminformatics, 11(1):1–9, December 2019

2019

[32] [32]

Automated exploration of the low-energy chemical space with fast quantum chemical methods.Physical Chemistry Chemical Physics, 22(14):7169–7192, April 2020

Philipp Pracht, Fabian Bohle, and Stefan Grimme. Automated exploration of the low-energy chemical space with fast quantum chemical methods.Physical Chemistry Chemical Physics, 22(14):7169–7192, April 2020

2020

[33] [33]

Christoph Bannwarth, Sebastian Ehlert, and Stefan Grimme. GFN2-xTB——an accurate and broadly parametrized self-consistent tight-binding quantum chemical method with multipole electrostatics and density-dependent dispersion contributions.Journal of Chemical Theory and Computation, 15(3):1652–1671, March 2019

2019

[34] [34]

Cheatham, Piotr Cieplak, Peter A

Jayashree Srinivasan, Thomas E. Cheatham, Piotr Cieplak, Peter A. Kollman, and David A. Case. Continuum Solvent Studies of the Stability of DNA, RNA, and Phosphoramidate−DNA Helices.Journal of the American Chemical Society, 120(37):9401–9409, September 1998

1998

[35] [35]

Software update: The ORCA program system—version 6.0

Frank Neese. Software update: The ORCA program system—version 6.0.WIREs Computational Molecular Science, 15(2):e70019, 2025. tex.eprint: https://wires.onlinelibrary.wiley.com/doi/pdf/10.1002/wcms.70019

work page doi:10.1002/wcms.70019 2025

[36] [36]

B97-3c: A revised low-cost variant of the B97-D density functional method.Journal of Chemical Physics, 148(6):064104, February 2018

Jan Gerit Brandenburg, Christoph Bannwarth, Andreas Hansen, and Stefan Grimme. B97-3c: A revised low-cost variant of the B97-D density functional method.Journal of Chemical Physics, 148(6):064104, February 2018. 16

2018

[37] [37]

Efficient, approximate and parallel Hartree–Fock and hybrid DFT calculations

Frank Neese, Frank Wennmohs, Andreas Hansen, and Ute Becker. Efficient, approximate and parallel Hartree–Fock and hybrid DFT calculations. A ‘chain-of-spheres’ algorithm for the Hartree–Fock exchange.Chemical Physics, 356(1):98–109, February 2009

2009

[38] [38]

Florian Weigend and Reinhart Ahlrichs. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy.Physical Chemistry Chemical Physics, 7(18):3297–3305, August 2005

2005

[39] [39]

Accurate coulomb-fitting basis sets for H to rn.Physical Chemistry Chemical Physics, 8(9):1057– 1065, February 2006

Florian Weigend. Accurate coulomb-fitting basis sets for H to rn.Physical Chemistry Chemical Physics, 8(9):1057– 1065, February 2006

2006

[40] [40]

Long-range corrected hybrid density functionals with damped atom–atom dispersion corrections.Physical Chemistry Chemical Physics, 10(44):6615–6620, November 2008

Jeng-Da Chai and Martin Head-Gordon. Long-range corrected hybrid density functionals with damped atom–atom dispersion corrections.Physical Chemistry Chemical Physics, 10(44):6615–6620, November 2008

2008

[41] [41]

Long-range corrected hybrid density functionals with improved dispersion corrections.Journal of Chemical Theory and Computation, 9(1):263–272, January 2013

You-Sheng Lin, Guan-De Li, Shan-Ping Mao, and Jeng-Da Chai. Long-range corrected hybrid density functionals with improved dispersion corrections.Journal of Chemical Theory and Computation, 9(1):263–272, January 2013. tex.eprint: 26589028

2013

[42] [42]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019. arXiv:1810.04805 [cs]

Pith/arXiv arXiv 2019

[43] [43]

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

2019

[44] [44]

Uncertainty- Guided Optimization on Large Language Model Search Trees, October 2024

Julia Grosse, Ruotian Wu, Ahmad Rashid, Philipp Hennig, Pascal Poupart, and Agustinus Kristiadi. Uncertainty- Guided Optimization on Large Language Model Search Trees, October 2024. arXiv:2407.03951 [cs]

arXiv 2024

[45] [45]

Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101, November 2017

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization.arXiv preprint arXiv:1711.05101, November 2017

Pith/arXiv arXiv 2017

[46] [46]

Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of Cheminformatics, 1(1):8, June 2009

Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of Cheminformatics, 1(1):8, June 2009. 17 Supporting Information S0.1 Supplementary material index Contents overview.The list below summarizes the sections and major items included in the ...

2009