pith. machine review for the scientific record. sign in

arxiv: 2605.08659 · v1 · submitted 2026-05-09 · 💻 cs.CE · q-bio.BM

Recognition: 2 theorem links

· Lean Theorem

Pushing Biomolecular Utility-Diversity Frontiers with Supergroup Relative Policy Optimization

Bin Feng, Hao Li, He Cao, Shenghua Gao, Xiangru Tang, Xinwu Ye, Yu Li, Zijing Liu

Pith reviewed 2026-05-12 01:06 UTC · model grok-4.3

classification 💻 cs.CE q-bio.BM
keywords supergroup relative policy optimizationbiomolecular designutility-diversity tradeoffpolicy optimizationde novo designsmall molecule generationprotein design
0
0 comments X

The pith

Supergroup Relative Policy Optimization expands the utility-diversity Pareto frontier for biomolecular generators by directly rewarding set-level diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Supergroup Relative Policy Optimization to solve the tension where reward-driven improvements in biomolecular generators narrow the range of outputs. It samples a supergroup of candidate sets for each condition, compares their diversity, and assigns the group diversity reward to individual samples using leave-one-out contributions before adding the usual utility reward. This construction works independently of the underlying generator, the utility function, and the choice of diversity metric. Experiments across de novo small-molecule design, pocket-based design, and de novo protein design show consistent expansion of the utility-diversity frontier relative to pretrained models and prior GRPO variants. A sympathetic reader would care because the approach offers a general route to keep variety while still optimizing for task-specific performance in generative design.

Core claim

SGRPO constructs rewards from set-level diversity by sampling a supergroup of candidate sets under the same condition, comparing their diversity, and redistributing the group diversity reward to individual rollouts through leave-one-out diversity contributions before combining it with rollout-level utility. This design decouples the method from any particular generator, utility reward, or diversity metric and allows instantiation with different GRPO-style approaches.

What carries the argument

Supergroup sampling with leave-one-out diversity contributions, which converts a set-level diversity score into per-rollout rewards that can be used directly in policy optimization.

If this is right

  • SGRPO achieves the best frontier-level metrics across decoding sweeps relative to pretrained generators, GRPO, and memory-assisted GRPO.
  • Direct set-level diversity rewards remain effective even with small supergroup sizes.
  • The approach helps preserve broader coverage of the generation distribution during post-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same supergroup construction could be tested on non-biomolecular generators where diversity is also a set-level property.
  • Optimal supergroup size may vary with task and generator type and could be tuned empirically.
  • Combining SGRPO with other set-level objectives beyond diversity would be a direct next step.

Load-bearing premise

That comparing diversity across supergroups of candidate sets and redistributing via leave-one-out contributions gives a stable, unbiased signal that does not introduce artifacts or depend on the particular diversity metric chosen.

What would settle it

An experiment that replaces the leave-one-out redistribution with uniform random assignment of the group diversity score and finds that the Pareto-frontier improvement over GRPO baselines disappears.

Figures

Figures reproduced from arXiv: 2605.08659 by Bin Feng, Hao Li, He Cao, Shenghua Gao, Xiangru Tang, Xinwu Ye, Yu Li, Zijing Liu.

Figure 1
Figure 1. Figure 1: Overview of SGRPO. For each condition, SGRPO samples a same-condition supergroup, computes rollout-level utility and group-level diversity, compares groups by leave-one-out group-relative diversity, redis￾tributes the diversity signal within each group according to leave-one-out set contributions, centers the composed rewards over the supergroup, and updates the policy with a PPO-style objective and KL reg… view at source ↗
Figure 2
Figure 2. Figure 2: Utility–diversity operating points for de novo small-molecule design, pocket-based small-molecule design, and de novo protein design. Each marker corresponds to one decoding setting and reports the mean utility and diversity over five independent runs; error bars show 95% confidence intervals on both axes. Dashed lines trace the method-specific non-dominated subsets. 5.2 De novo Small-Molecule Design Setup… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation study on GenMol￾based de novo small-molecule generation. Each point reports the mean over five in￾dependent sweeps, and error bars indicate 95% confidence intervals for utility and diversity. Removing the diversity term yields coupled-GRPO, while removing leave-one-out group credit weakens set￾aware credit assignment. We isolate two components of SGRPO on the GenMol-based de novo small-molecule de… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of diversity-estimator efficiency and group-reward weighting on the GenMol-based [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution dynamics during ProGen2 post-training. SGRPO explores multiple clusters early and preserves them, whereas GRPO contracts and Memory-assisted GRPO drift toward a narrow distant region. We study two practical sensitivities of SGRPO on the GenMol-based de novo small-molecule design task: the efficiency of the group-diversity estimator and the choice of group-reward weight λ. In both cases, we tra… view at source ↗
read the original abstract

Biomolecular generators are often adapted with reward feedback to improve task-specific utility, but pushing utility alone can concentrate generation on a narrow family of candidates. Maintaining diversity is difficult because sample diversity is a set-level property. We introduce Supergroup Relative Policy Optimization (SGRPO), a flexible GRPO-style framework that directly constructs rewards from set-level diversity. For each condition, SGRPO samples a supergroup of candidate sets, compares their diversity under the same condition, and redistributes the group diversity reward to individual rollouts through leave-one-out diversity contributions before combining it with rollout-level utility. This design decouples SGRPO from a particular generator, utility reward, or diversity metric, and allows instantiation with different GRPO-style approaches. We evaluate SGRPO on de novo small-molecule design, pocket-based small-molecule design, and de novo protein design, instantiating it with both GRPO and Coupled-GRPO across autoregressive and discrete diffusion generators. Across decoding sweeps, SGRPO expands the utility-diversity Pareto frontier and achieves the best frontier-level metrics relative to pretrained generators, GRPO, and memory-assisted GRPO when applicable. Our analyses further show that direct set-level diversity rewards remain effective with small groups and help preserve broader generation-distribution coverage during post-training. The code is available at https://github.com/IDEA-XL/SGRPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Supergroup Relative Policy Optimization (SGRPO), a GRPO-style RL framework for biomolecular generators that addresses set-level diversity by sampling supergroups of candidate sets per condition, computing group-level diversity scores, redistributing them to individual rollouts via leave-one-out contributions, and combining with per-rollout utility rewards. It instantiates the method with GRPO and Coupled-GRPO on autoregressive and discrete diffusion models for de novo small-molecule design, pocket-based design, and de novo protein design. Across decoding sweeps, SGRPO is reported to expand the utility-diversity Pareto frontier and outperform pretrained generators, standard GRPO, and memory-assisted GRPO on frontier-level metrics while preserving broader distributional coverage; the approach is claimed to remain effective with small supergroups and to be decoupled from specific generators, utility functions, or diversity metrics. Code is released.

Significance. If the central claims hold, SGRPO offers a modular, metric-agnostic way to incorporate set-level diversity signals into policy optimization for generative models in computational biology and chemistry. This directly targets the common failure mode where utility-only RL collapses generation to narrow families. The empirical breadth across three tasks and two generator families, plus explicit code release, would make the contribution practically useful for downstream biomolecular design workflows.

major comments (2)
  1. [Method description and experimental analyses] The leave-one-out redistribution of the supergroup diversity score (full-group diversity minus diversity without the focal rollout) induces strong negative dependence among the per-rollout diversity rewards within each supergroup. This dependence is load-bearing for the combined advantage used in the GRPO-style update, yet the manuscript provides no analysis of the resulting reward variance, gradient stability, or sensitivity to supergroup size and diversity metric choice. The abstract asserts effectiveness with small groups, but without quantitative checks (e.g., variance of the diversity component or ablation on group size), it is unclear whether the reported Pareto improvements are robust or partly artifacts of the correlation structure.
  2. [Experimental results and figures] The central empirical claim—that SGRPO expands the utility-diversity frontier relative to GRPO and memory-assisted GRPO—rests on decoding sweeps and frontier-level metrics, but the abstract and available text give no details on statistical significance testing, exact baseline implementations, or how post-hoc selection of decoding parameters was avoided. Without these, it is difficult to assess whether the reported gains are reproducible or driven by implementation differences.
minor comments (2)
  1. [Method] Notation for supergroup sampling and leave-one-out computation should be formalized with explicit equations to make the redistribution step unambiguous.
  2. [Experiments] The diversity metric(s) used in the reported experiments should be stated explicitly, along with any ablations showing invariance to metric choice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional analyses and clarifications where the comments identify opportunities for strengthening the presentation.

read point-by-point responses
  1. Referee: [Method description and experimental analyses] The leave-one-out redistribution of the supergroup diversity score (full-group diversity minus diversity without the focal rollout) induces strong negative dependence among the per-rollout diversity rewards within each supergroup. This dependence is load-bearing for the combined advantage used in the GRPO-style update, yet the manuscript provides no analysis of the resulting reward variance, gradient stability, or sensitivity to supergroup size and diversity metric choice. The abstract asserts effectiveness with small groups, but without quantitative checks (e.g., variance of the diversity component or ablation on group size), it is unclear whether the reported Pareto improvements are robust or partly artifacts of the correlation structure.

    Authors: We agree that the leave-one-out construction deliberately introduces negative dependence to isolate marginal contributions, and that this structure merits explicit examination. The manuscript does report that set-level rewards remain effective with small supergroups (via the analyses referenced in the abstract and experimental results), but we acknowledge the absence of direct quantification of reward variance or gradient stability. In the revised manuscript we will add (i) an ablation varying supergroup size (2–8) with reported variance of the diversity reward component across multiple seeds, (ii) a short discussion of how the induced correlation affects advantage estimates, and (iii) sensitivity checks across the diversity metrics used. These additions will clarify that the observed Pareto gains are not artifacts of the correlation structure. revision: yes

  2. Referee: [Experimental results and figures] The central empirical claim—that SGRPO expands the utility-diversity frontier relative to GRPO and memory-assisted GRPO—rests on decoding sweeps and frontier-level metrics, but the abstract and available text give no details on statistical significance testing, exact baseline implementations, or how post-hoc selection of decoding parameters was avoided. Without these, it is difficult to assess whether the reported gains are reproducible or driven by implementation differences.

    Authors: We accept that the current text does not supply sufficient detail on these points. In the revision we will (i) report statistical significance (error bars and, where appropriate, p-values from repeated runs with different random seeds), (ii) provide precise descriptions of baseline implementations including hyperparameter matching and any memory-buffer details, and (iii) explicitly state that the reported frontiers derive from a fixed, exhaustive decoding-parameter grid without post-hoc selection. These clarifications will be added to the experimental section and figure captions. revision: yes

Circularity Check

0 steps flagged

SGRPO introduces independent supergroup sampling and leave-one-out redistribution without reducing to self-definition or fitted inputs

full rationale

The paper defines SGRPO explicitly as a GRPO-style extension that samples supergroups, computes set-level diversity, and redistributes via leave-one-out contributions before adding utility rewards. This is a constructive algorithmic choice presented in the abstract, not derived from prior equations or self-citations in a load-bearing way. No equations or claims in the provided text equate a 'prediction' to a fitted parameter by construction, invoke uniqueness theorems from the same authors, or rename known results. Empirical evaluations on multiple design tasks compare against baselines without circular reduction. The framework remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that set-level diversity is a meaningful and redistributable reward signal that can be decoupled from specific generators and metrics.

axioms (1)
  • domain assumption Set-level diversity can be meaningfully quantified and used as a reward signal in policy optimization.
    Core mechanism of SGRPO relies on comparing diversity across supergroups and redistributing via leave-one-out.

pith-pipeline@v0.9.0 · 5566 in / 1256 out tokens · 54082 ms · 2026-05-12T01:06:16.303088+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 3 internal anchors

  1. [2]

    Model-based reinforcement learning for biological sequence design

    Christof Angermueller, David Dohan, David Belanger, Ramya Deshpande, Kevin Murphy, and Lucy Colwell. Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019

  2. [3]

    Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

    Viraj Bagal, Rishal Aggarwal, PK Vinod, and U Deva Priyakumar. Molgpt: molecular generation using a transformer-decoder model.Journal of chemical information and modeling, 62(9): 2064–2076, 2021

  3. [4]

    Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

    Dávid Bajusz, Anita Rácz, and Károly Héberger. Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?Journal of cheminformatics, 7(1):20, 2015

  4. [5]

    Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

    G Richard Bickerton, Gaia V Paolini, Jérémy Besnard, Sorel Muresan, and Andrew L Hopkins. Quantifying the chemical beauty of drugs.Nature chemistry, 4(2):90–98, 2012

  5. [6]

    Esben Jannik Bjerrum, Christian Margreitter, Thomas Blaschke, Simona Kolarova, and Raquel López-Ríos de Castro. Faster and more diverse de novo molecular optimization with double- loop reinforcement learning using augmented smiles.Journal of Computer-Aided Molecular Design, 37(8):373–394, 2023

  6. [7]

    Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

    Thomas Blaschke, Ola Engkvist, Jürgen Bajorath, and Hongming Chen. Memory-assisted reinforcement learning for diverse molecular de novo design.Journal of cheminformatics, 12 (1):68, 2020

  7. [8]

    Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

    David H Brookes and Jennifer Listgarten. Design by adaptive sampling.arXiv preprint arXiv:1810.03714, 2018

  8. [9]

    Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

    Egbert Castro, Abhinav Godavarthi, Julian Rubinfien, Kevin Givechian, Dhananjay Bhaskar, and Smita Krishnaswamy. Transformer-based protein generation with regularized latent space optimization.Nature Machine Intelligence, 4(10):840–851, 2022

  9. [10]

    Curiosity as a self- supervised method to improve exploration in de novo drug design

    Mohamed-Amine Chadi, Hajar Mousannif, and Ahmed Aamouche. Curiosity as a self- supervised method to improve exploration in de novo drug design. In2023 International Conference on Information Technology Research and Innovation (ICITRI), pages 151–156. IEEE, 2023

  10. [11]

    Decomposed direct preference optimization for structure-based drug design.arXiv preprint arXiv:2407.13981, 2024

    Xiwei Cheng, Xiangxin Zhou, Yuwei Yang, Yu Bao, and Quanquan Gu. Decomposed direct preference optimization for structure-based drug design.arXiv preprint arXiv:2407.13981, 2024

  11. [12]

    Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

    Justas Dauparas, Ivan Anishchenko, Nathaniel Bennett, Hua Bai, Robert J Ragotte, Lukas F Milles, Basile IM Wicky, Alexis Courbet, Rob J de Haas, Neville Bethel, et al. Robust deep learning–based protein sequence design using proteinmpnn.Science, 378(6615):49–56, 2022

  12. [14]

    Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

    Yasha Ektefaie, Olivia Viessmann, Siddharth Narayanan, Drew Dresser, J Mark Kim, and Armen Mkrtchyan. Reinforcement learning on structure-conditioned categorical diffusion for protein inverse folding.arXiv preprint arXiv:2410.17173, 2024

  13. [15]

    Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

    Peter Ertl and Ansgar Schuffenhauer. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions.Journal of cheminfor- matics, 1(1):8, 2009

  14. [16]

    Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021

    Vendy Fialková, Jiaxi Zhao, Kostas Papadopoulos, Ola Engkvist, Esben Jannik Bjerrum, Thierry Kogej, and Atanas Patronov. Libinvent: reaction-based generative scaffold decoration for in silico library design.Journal of Chemical Information and Modeling, 62(9):2046–2063, 2021. 11

  15. [17]

    Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

    Paul G Francoeur, Tomohide Masuda, Jocelyn Sunseri, Andrew Jia, Richard B Iovanisci, Ian Snyder, and David R Koes. Three-dimensional convolutional neural networks and a cross- docked data set for structure-based drug design.Journal of chemical information and modeling, 60(9):4200–4215, 2020

  16. [18]

    Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

    Raj Ghugare, Santiago Miret, Adriana Hugessen, Mariano Phielipp, and Glen Berseth. Search- ing for high-value molecules using reinforcement learning and transformers.arXiv preprint arXiv:2310.02902, 2023

  17. [19]

    Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

    Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud, José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling, Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automatic chemical design using a data-driven continuous representation of molecules.ACS central science, 4(2):268–276, 2018

  18. [20]

    Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

    Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, and Yizhe Zhang. Diffucoder: Understanding and improving masked diffusion models for code generation.arXiv preprint arXiv:2506.20639, 2025

  19. [21]

    Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

    Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained bayesian optimization for automatic chemical design using variational autoencoders.Chemical science, 11(2):577–586, 2020

  20. [22]

    Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

    Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Utilizing reinforcement learning for de novo drug design.Machine Learning, 113(7):4811–4843, 2024

  21. [23]

    Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

    Max Hebditch, M Alejandro Carballo-Amador, Spyros Charonis, Robin Curtis, and Jim War- wicker. Protein–sol: a web tool for predicting protein solubility from sequence.Bioinformatics, 33(19):3098–3100, 2017

  22. [24]

    Learning inverse folding from millions of predicted structures

    Chloe Hsu, Robert Verkuil, Jason Liu, Zeming Lin, Brian Hie, Tom Sercu, Adam Lerer, and Alexander Rives. Learning inverse folding from millions of predicted structures. InInternational conference on machine learning, pages 8946–8970. PMLR, 2022

  23. [25]

    Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

    Xiuyuan Hu, Guoqing Liu, Quanming Yao, Yang Zhao, and Hao Zhang. Hamiltonian diver- sity: effectively measuring molecular diversity by shortest hamiltonian circuits.Journal of Cheminformatics, 16(1):94, 2024

  24. [26]

    Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

    Hyosoon Jang, Yunhui Jang, Jaehyung Kim, and Sungsoo Ahn. Can llms generate diverse molecules? towards alignment with structural diversity.arXiv preprint arXiv:2410.03138, 2024

  25. [27]

    Multi-objective molecule generation using interpretable substructures

    Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Multi-objective molecule generation using interpretable substructures. InInternational conference on machine learning, pages 4849–4859. PMLR, 2020

  26. [28]

    Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

    Alexia Jolicoeur-Martineau, Aristide Baratin, Kisoo Kwon, Boris Knyazev, and Yan Zhang. Any-property-conditional molecule generation with self-criticism using spanning trees.arXiv preprint arXiv:2407.09357, 2024

  27. [29]

    Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

    Panagiotis-Christos Kotsias, Josep Arús-Pous, Hongming Chen, Ola Engkvist, Christian Tyr- chan, and Esben Jannik Bjerrum. Direct steering of de novo molecular generation with descriptor conditional recurrent neural networks.Nature Machine Intelligence, 2(5):254–265, 2020

  28. [30]

    Advances and 42 challenges in deep generative models for de novo molecule generation.Wiley Interdis- ciplinary Reviews: Computational Molecular Science2019,9, e1395

    Seul Lee, Karsten Kreis, Srimukh Prasad Veccham, Meng Liu, Danny Reidenbach, Yuxing Peng, Saee Paliwal, Weili Nie, and Arash Vahdat. Genmol: A drug discovery generalist with discrete diffusion.arXiv preprint arXiv:2501.06158, 2025

  29. [31]

    CAGenMol: Condition-Aware Diffusion Language Model for Goal-Directed Molecular Generation

    Yanting Li, Zhuoyang Jiang, Enyan Dai, Lei Wang, Wen-Cai Ye, and Li Liu. Cagenmol: Condition-aware diffusion language model for goal-directed molecular generation.arXiv preprint arXiv:2604.11483, 2026

  30. [32]

    Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018

    Jaechang Lim, Seongok Ryu, Jin Woo Kim, and Woo Youn Kim. Molecular generative model based on conditional variational autoencoder for de novo molecular design.Journal of cheminformatics, 10(1):31, 2018. 12

  31. [33]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

  32. [34]

    Xuhan Liu, Kai Ye, Herman WT Van Vlijmen, Adriaan P IJzerman, and Gerard JP Van Westen. An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine a2a receptor.Journal of cheminformatics, 11(1):35, 2019

  33. [35]

    Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

    Xuhan Liu, Kai Ye, Herman WT van Vlijmen, Adriaan P IJzerman, and Gerard JP van Westen. Drugex v3: scaffold-constrained drug design with graph transformer-based reinforcement learning.Journal of Cheminformatics, 15(1):24, 2023

  34. [36]

    Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

    Hannes H Loeffler, Jiazhen He, Alessandro Tibo, Jon Paul Janet, Alexey V oronov, Lewis H Mervin, and Ola Engkvist. Reinvent 4: Modern ai–driven generative molecule design.Journal of Cheminformatics, 16(1):20, 2024

  35. [37]

    Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

    Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. Progen2: exploring the boundaries of protein language models.Cell systems, 14(11):968–978, 2023

  36. [38]

    Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

    Emmanuel Noutahi, Cristian Gabellini, Michael Craig, Jonathan SC Lim, and Prudencio Tossou. Gotta be safe: a new framework for molecular design.Digital Discovery, 3(4):796–804, 2024

  37. [39]

    Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

    Marcus Olivecrona, Thomas Blaschke, Ola Engkvist, and Hongming Chen. Molecular de-novo design through deep reinforcement learning.Journal of cheminformatics, 9(1):48, 2017

  38. [40]

    Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, and Jibum Kim. Mol-air: Molecular reinforce- ment learning with adaptive intrinsic rewards for goal-directed molecular generation.Journal of Chemical Information and Modeling, 65(5):2283–2296, 2025

  39. [41]

    Improving Inverse Folding for Peptide Design with Diversity-regularized Direct Preference Optimization

    Ryan Park, Darren J Hsu, C Brian Roland, Maria Korshunova, Chen Tessler, Shie Mannor, Olivia Viessmann, and Bruno Trentini. Improving inverse folding for peptide design with diversity-regularized direct preference optimization.arXiv preprint arXiv:2410.19471, 2024

  40. [42]

    Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

    Tiago Pereira, Maryam Abbasi, Bernardete Ribeiro, and Joel P Arrais. Diversity oriented deep reinforcement learning for targeted molecule generation.Journal of cheminformatics, 13(1):21, 2021

  41. [43]

    Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

    Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design.Science advances, 4(7):eaap7885, 2018

  42. [44]

    Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

    Chiara Rodella, Symela Lazaridi, and Thomas Lemmin. Temberture: advancing protein ther- mostability prediction with deep learning and attention mechanisms.Bioinformatics Advances, 4(1):vbae103, 2024

  43. [45]

    Silvr: guided diffusion for molecule generation

    Nicholas T Runcie and Antonia SJS Mey. Silvr: guided diffusion for molecule generation. Journal of chemical information and modeling, 63(19):5996–6005, 2023

  44. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  45. [47]

    Sequential posterior sampling with diffusion models

    Tristan SW Stevens, Oisín Nolan, Jean-Luc Robert, and Ruud JG Van Sloun. Sequential posterior sampling with diffusion models. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

  46. [48]

    Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

    Hampus Gummesson Svensson, Christian Tyrchan, Ola Engkvist, and Morteza Haghir Chehreghani. Diversity-aware reinforcement learning for de novo drug design.arXiv preprint arXiv:2410.10431, 2024

  47. [49]

    Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025

    Hampus Gummesson Svensson, Ola Engkvist, Jon Paul Janet, Christian Tyrchan, and Morteza Haghir Chehreghani. Diverse mini-batch selection in reinforcement learning for efficient chemical exploration in de novo drug design.arXiv preprint arXiv:2506.21158, 2025. 13

  48. [50]

    Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation

    Morgan Thomas, Noel M O’Boyle, Andreas Bender, and Chris De Graaf. Augmented hill-climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of cheminformatics, 14(1):68, 2022

  49. [51]

    Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

    Austin Tripp, Erik Daxberger, and José Miguel Hernández-Lobato. Sample-efficient optimiza- tion in the latent space of deep generative models via weighted retraining.Advances in Neural Information Processing Systems, 33:11259–11272, 2020

  50. [52]

    Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading.Journal of computational chemistry, 31(2):455–461, 2010

  51. [53]

    Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

    Ziwen Wang, Jiajun Fan, Ruihan Guo, Thao Nguyen, Heng Ji, and Ge Liu. Pro- teinzero: Self-improving protein generation via online reinforcement learning.arXiv preprint arXiv:2506.07459, 2025

  52. [54]

    Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

    Talal Widatalla, Rafael Rafailov, and Brian Hie. Aligning protein generative models with experimental fitness via direct preference optimization.bioRxiv, pages 2024–05, 2024

  53. [55]

    Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

    Junhao Xiong, Ishan Gaur, Maria Lukarska, Hunter Nisonoff, Luke M Oltrogge, David F Savage, and Jennifer Listgarten. Proteinguide: On-the-fly property guidance for protein sequence generative models.arXiv preprint arXiv:2505.04823, 2025

  54. [56]

    Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

    Soojung Yang, Doyeong Hwang, Seul Lee, Seongok Ryu, and Sung Ju Hwang. Hit and lead discovery with explorative rl and fragment-based molecule generation.Advances in Neural Information Processing Systems, 34:7924–7936, 2021

  55. [57]

    Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

    Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, and Jure Leskovec. Graph convolutional policy network for goal-directed molecular graph generation.Advances in neural information processing systems, 31, 2018

  56. [58]

    Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

    Zhenpeng Zhou, Steven Kearnes, Li Li, Richard N Zare, and Patrick Riley. Optimization of molecules via deep reinforcement learning.Scientific reports, 9(1):10752, 2019

  57. [59]

    Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025

    Xingzheng Zhu, Zhihong Zhao, and Fei Zhu. Scaffold-driven molecular generation via rein- forced rnn with centroid distance evaluation.Expert Systems with Applications, 292:128606, 2025. 14 A Full Training Procedure of SGRPO This appendix provides the full training procedure of Supergroup Relative Policy Optimization (SGRPO), corresponding to Section 4 in ...